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Test  reliability  and  validity  are  determined  by  the  quality  of  the 
items  in  the  tests.   Tlirough  the  application  of  item  analysis  procedures, 
test  constructors  are  able  to  obtain  quantitative,  objective  information 
useful  in  developing  and  judging  the  quality  of  a  test  and  its  items. 

Classical  test  theory  forms  the  basis  for  one  method  of  test 
development.   An  integral  part  of  the  development  of  tests  based  on  the 
classical  model  is  selection  of  a  final  set  of  items  from  an  item  pool 
based  on  classical  item  analysis  or  factor  analysis.   Classical  item 
analysis  requires  identification  of  single  items  which  provide  maximum 
discrimination  between  individuals  on  the  latent  trait  being  measured. 
The  bi serial  correlation  between  item  score  and  total  score  is  coimnonly 
used  as  an  index  of  item  discrimination. 

An  alternative  method  of  test  development,  but  based  on  the 
classical  model,  is  factor  analysis.   Factor  analysis  is  a  more  complex 
test  development  procedure  than  classical  item  analysis.   It  is  a 
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statistical  teclmique  that  takes  into  account  the  item  correlation 
with  all  other  individual  items  in  the  test  simultaneously.   Thus, 
classical  item  analysis  can  be  viewed  as  a  unidimensional  basis  for 
item  analysis,  less  sophisticated  tlian  tlie  multidimensional  procedure 
of  factor  analysis. 

Recently,  the  field  of  latent  trait  theory  has  provided  a  new  approach 
to  test  construction.   Several  latent  trait  models  have  been  developed; 
however,  this  study  was  concerned  only  with  the  one-parameter  logistic 
Rasch  model.   The  Rasch  model  was  chosen  because  it  is  the  most  parsi- 
monious of  the  latent  trait  models  and  has  recently  been  used  in  the 
development  and  equating  of  tests. 

A  review  of  the  literature  revealed  numerous  studies  conducted  in 
each  of  the  three  areas  of  item  analysis,  but  no  comparative  studies 
were  reported  among  all  three  item  analytic  techniques.   Therefore, 
the  present  study  was  designed  to  compare  the  methods  of  classical 
item  analysis,  factor  analysis,  and  the  Rasch  model  in  terms  of  test 
precision  and  relative  efficiency. 

An  empirical  study  was  designed  to  compare  the  effects  of  the 
three  methods  of  item  analysis  on  test  development  across  different 
sample  sizes  of  250,  500,  and  995  subjects.   Item  response  data  were 
obtained  from  a  sample  of  5,235  high  school  seniors  on  a  50  item  cogni- 
tive test  of  verbal  aptitude.   The  subjects  were  divided  into  nine 
independent  samples,  one  for  each  item  analytic  technique  and  sample 
size.   The  study  was  conducted  in  three  phases:   item  selection, 
computation  of  item  and  test  statistics  for  selected  items  on  double 
cross-validation  samples,  and  statistical  analyses  of  item  characteris- 
tics.  For  each  item  analytic  procedure  two  tests  were  developed: 


IX 


a  15  item  test,  and  a  50  item  test.   Four  dependent  variables  were 
obtained  for  eacli  test  to  assess  precision:   internal  consistency 
estimates,  standard  error  of  measurement,  item  difficulties,  and  item 
discriminations.   In  addition,  tlic  relative  efficiencies  of  the  30  item 
tests  developed  by  each  item  analytic  technicjue  were  compared  for  the 
sample  of  995  subjects. 

The  results  of  the  analysis  revealed  tliat  there  were  no  differences 
between  the  tests  developed  by  the  three  methods  of  item  analysis, 
in  terms  of  the  precision  of  measurement.   In  terms  of  efficiency, 
substantive  differences  between  the  tests  produced  by  the  three  item 
analytic  methods  were  observed.   Specifically,  the  tests  based  on  class- 
ical test  theory  were  more  effective  for  measuring  very  low  and  very 
high  ability  students.   Tlie  Rasch  developed  test  was  more  efficient  for 
assessing  average  and  high  ability  students. 


CHAPTFR  I 
INTRODUCTION 

The  systematic  approach  to  test  development  was  initiated  by  Binet 
and  Simon  in  191(i.   Since  that  time  psychometricians  have  been  concerned 
with  the  extent  to  wliich  accurate  measurement  of  a  person's  "ability" 
is  possible.   Most  measurement  experts  agree  that  upon  repeated  testing 
an  individual's  observed  score  will  vary  even  though  his  true  ability 
remains  constant.   Tliis  variability  is  the  essence  of  classical  test 
tlieory . 

Classical  test  theory  is  based  upon  the  assumption  that  a  person's 
observed  score  (X)  is  made  up  of  a  true  score  (T)  and  error  score  (E) 
denoted: 

X  =  T  +  E.  (1) 

Limited  by  few  assumptions,  this  theory  has  wide  applications.   The  few 
assumptions  pertain  to  the  eri'or  score  (Magnusson,  1966,  p.  64): 

1.  Tlie  mean  of  an  examinee's  error  scores  on  an  infinite 
number  of  jiarallel  tests  is  zero. 

2.  The  correlation  between  examinee's  error  scores  on  parallel 
tests  is  zero. 

3.  The  correlation  between  examinees'  error  scores  and  true 
scores  is  zero. 

Relying  upon  these  assumptions,  psychometricians  liave  used  the  observed 
score  (X)  to  represent  the  best  estimate  of  a  person's  true  score  (T) . 
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The  accuracy  of  the  observed  score  [XJ  in  representing  an  examinee's 
true  score  (T)  is  described  by  the  reliability  coefficient.   One  definition 
of  reliability  is  given  by  the  coefficient  of  precision.   This  coefficient 
is  the  correlation  between  truly  parallel  tests,  assuining  the  examinee's 
true  score  does  not  change  between  two  measurements.   Lord  and  Novick 
(1968]  have  defined  truly  parallel  tests  to  be  those  for  which,  "the 
expected  values  [true  scores]  of  parallel  measurements  are  equal;  and  the 
observed  score  variances  of  parallel  measurements  are  equal  (p.  4S)." 

The  reliability  coefficient  for  the  population  is  defined  as 

(Lord  and  Novick,  1968,  p.  134): 

2  '' 

r   =   "^T^   -  1  -   ^e"   ,  (2) 

XX ^-       ■  — 

2  2 

where  o   is  the  true  score  variance,  o,,  is  the  observed  score  variance, 

i  A 

and  Op  is  the  error  score  variance.   IVhen  this  expression  is  used  to 
represent  the  coefficient  of  precision,  it  can  be  interpreted  as  the 
extent  to  which  unreliability  is  due  solely  to  inadequacies  of  the  test 
form  and  testing  procedure  rather  than  due  to  changes  in  examinees  over 
time. 

The  coefficient  of  precision  is  a  theoretical  value  because  the 

2       2 
components  o   and  o   cannot  be  observed.   The  coefficient  of  precision 

is  usually  estimated  by  internal  consistency  methods.   Internal  con- 
sistency is  a  measure  of  the  relationship  between  random  parallel  tests. 
Random  parallel  tests  are  composed  of  items  drawn  from  the  same  population 
of  items  (Magnusson,  1966,  p.  102-105).   Scores  on  these  tests  may 
differ  somewhat  from  true  scores  in  means,  standard  deviations, and 
correlations  because  of  random  errors  in  the  sampling  of  items.   However, 
random  parallel  tests  are  more  often  encountered  in  practice  than  are 


truly  parallel  tests.   Cronbach's  coefficient  alpha  (1951)  is  the 
internal  consistency  coefficient  commonly  used  to  represent  tlie  average 
correlation  among  all  possible  tests  created  by  dividing  the  domain 
into  random  lialves.   Thus,  the  intern.al  consistency  coefficient  indicates 
the  extent  to  which  all  tlie  items  are  measuring  the  same  ability  or  trait, 
Psychological  traits  are  often  described  as  latent  because  they  cannot 
be  directly  observed.   Therefore,  psychological  tests  are  developed  in 
an  attempt  to  measure  these  latent  traits. 

Classical  test  tlieory  forms  the  basis  for  one  metliod  of  test 
development.   An  integral  part  of  the  development  of  tests  based  on  the 
classical  model  is  the  utilization  of  classical  item  analysis  or  factor 
analysis.   Classical  item  analysis  is  a  procedure  to  obtain  a  description 
of  the  statistical  cliaracteristics  of  each  item  in  the  test.   This 
approach  requires  identification  of  single  items  which  provide  maximum 
discrimination  between  individuals  on  the  latent  trait  being  measured. 
Theoretically,  selecting  items  which  have  high  correlations  with  total 
test  score  will  result  in  a  discriminating  test  winch  is  homogeneous 
with  respect  to  tlie  latent  trait.   Therefore,  classical  item  analysis 
is  an  aid  to  developing  internally  consistent  tests. 

An  alternative  method  of  test  development,  but  based  on  the  classi- 
cal model,  is  factor  analysis.   Factor  analysis  is  a  more  complex  test 
development  procedure  than  classical  item  analysis.   It  is  a  statistical 
technique  that  takes  into  account  the  item  correlation  with  all  other 
individual  items  in  the  test  simultaneously.   Groups  of  similar  items 
tend  to  cluster  together  and  comprise  the  latent  traits  (factors)  under- 
lying the  test.   Under  the  classical  model  then,  classical  item  analysis 
can  be  viewed  as  a  unidimensional  basis  for  item  analysis,  less 
sophisi ticated  than  the  multidimensional  ]irocedure  of  factor  analysis. 


The  purpose  of  factor  analysis  is  to  represent  a  variable  in  terms 
of  one  or  several  underlying  factors  lHarman,  1967).   Depending  upon  the 
objective  of  the  analysis,  two  general  approaches  are  used  in 
factor  analysis:   (a)  common  factor  analysis,  and  (h)  principal  com- 
ponents analysis.   A  common  factor  solution  would  he  warranted  if  the 
researcher  were  interested  in  determining  the  number  of  common  and 
unique  factors  underlying  a  given  test.   A  principal  component  solution 
would  be  warranted  if  it  were  of  interest  to  extract  the  maximum 
amount  of  variance  from  a  given  test. 

Regardless  of  the  apju-oach  used,  factor  analysis  is  an  item  analytic 
technique  in  whicli  all  test  items  are  considered  simultaneously  to  pro- 
duce a  matrix  of  item  correlations  with  factors.   It  is  these  correlations 
or  item  loadings  tliat  indicate  the  strength  of  the  factor  and  also  the 
number  of  factors  underlying  the  test.   However,  factor  analysis  shares 
the  weakness  of  classical  item  analysis,  that  of  being  sample  dependent. 


Critics  of  classical  test  theory  contend  that  a  major  weakness  of 
tests  developed  from  this  model  is  that  the  item  statistics  vary  when 
the  examinee  group  changes;  item  statistics  may  also  vary  if  a  different 
set  of  items  from  the  same  domain  is  used  with  the  same  examinee  group 
(llambleton  and  Cook,  1977;  Wright,  1968).   Thus,  the  selection  of  a  final 
set  of  test  items  will  be  sample  dependent. 

Until  recently,  classical  item  analysis  and  factor  analysis  were  tlie 
only  techniques  described  in  measurement  texts  for  use  in  item  analysis 
and  test  development  (Baker,  1977).   However,  with  the  publication  of 
Lord  and  Novick's  Statistical  Theories  of  Mental  Test  Scores  (1968)  and 
the  availability  of  computer  programs,  considerable  attention  is  being 
directed  now  toward  the  field  of  latent  trait  tlieory  as  a  new  area  in 


test  development.   Latent  trait  theory  dates  back  to  Lazarsfeld  (1950) 
who  introduced  the  concept;  however,  Fredrick  Lord  is  generally  given 
credit  as  the  father  of  latent  trait  theory  (Hambleton,  Swaminathan, 
Cook,  Eignor,  and  Gifford,  1977).   Proponents  of  this  approach  claim 
that  the  advantages  of  latent  trait  theory  over  classical  test  theory 
are  twofold:   (a)  theoretically  it  provides  item  parameters  which  are 
invariant  across  examinee  samples  which  will  differ  with  respect  to 
the  latent  trait,  and  (b)  it  provides  item  characteristic  curves  that 
give  insight  into  how  specific  items  discriminate  between  students  of 
varying  abilities.   These  properties  of  latent  trait  theory  will  be 
presented  in  more  detail  in  Chapter  II. 

Four  latent  trait  models  have  been  developed  for  use  with 
dichotomously  scored  data:   the  normal  ogive,  and  tlie  one-,  two-,  and 
three-parameter  logistic  model  (Hambleton  and  Cook,  1977;  Lord  and  Novick, 
1968).   This  study  is  concerned  with  the  one-parameter  logistic  Rasch 
model  because  it  is  the  simplest  of  the  four  models. 

Tests  developed  using  the  Rasch  model  are  intended  to  provide 
objective  measurement  of  the  examinee's  true  ability  on  the  latent 
trait  in  question,  as  well  as  providing  for  invariant  item  parameters 
(Rasch,  1966;  Wright,  1968).   lliat  is,  any  subset  of  items  from  a 
population  of  items  that  have  been  calibrated  by  the  Rasch  model  should 
accurately  measure  the  examinee's  true  ability  regardless  of  whether  the 
items  are  very  easy  or  very  difficult;  also,  the  item  parameters  should 
remain  constant  over  different  examinees.   In  measurements  obtained  from 
classical  test  theory  this  objective  feature  is  rarely  attained.   The 
item  parameters  associated  with  classical  test  theory  are  group  and 
item  specific.   That  is,  the  item  parameters  are  determined  by  the 


ability  of  the  people  taking  the  test  and  the  subset  of  items  chosen. 
Wright  (196S)  has  stated,  "llie  growth  of  science  depends  on  the  develop- 
ment of  objective  methods  for  transforming  an  observation  into  measure- 
ment (p.  S6)."   Latent  trait  theory  is  an  attempt  to  develop  mental 
measurement  into  a  technique  similar  to  measurement  in  the  physical 
sciences . 

Latent  trait  theory  is  based  on  strong  assumptions  that  are  re- 
strictive and  lience  limit  its  application  [Hambleton  and  Cook,  1977).   The 
assumptions  required  for  the  Rasch  model  are  the  following  (Rasch,  1966): 

1.  Tlie  test  is  unidimensional,  e.g.,  there  is  only  one  factor 
or  trait  underlying  test  performance. 

2.  The  item  responses  of  each  examinee  are  locally  independent, 
e.g.,  success  or  failure  on  one  item  does  not  hinder  other  item  responses. 

3.  Tlie  item  discriminations  are  equal,  e.g.,  all  items  load 
equally  on  the  factor  underlying  the  test. 

Lord  and  Xovick  (1968)  noted  that  the  assumptions  of  unidimensionality 
and  local  independence  are  synonymous.   To  say  that  only  one  underlying 
ability  is  being  tested  means  the  items  are  statistically  independent 
for  persons  at  the  same  ability  level.   The  third  assumption  relates  to 
item  characteristic  curves.   Tlie  item  characteristic  curve  is  a  mathematical 
function  that  relates  the  probiibility  of  success  on  an  item  to  the 
ability  measured  by  tlie  test.   Curves  vary  in  slope  and  intercept  to 
reflect  how  items  vary  in  discrimination  and  difficulty.   Tlie  one-param- 
eter logistic  Rasch  model  (the  one  parameter  is  item  difficulty)  assumes 
all  item  discriminations  are  equal.   Tlius  all  item  characteristic 
curves  should  be  similai-  with  respect  to  their  slopes. 


The  Problem 
Several  studies  have  been  conducted  to  varify  the  invariant  prop- 
erties  of  tests  constructed  using  the  Rascli  model  (Tinsley  and  Dawis, 
1975;  IVhitely  and  Dawis,  1974;  Wright,  1968).   If  we  assume  that  tests 
developed  using  latent  trait  theory  possess  the  quality  of  invariant 
item  statistics,  why  then  liasn't  latent  trait  theory  been  more  visible 
in  the  psychometric  community?  There  appear  to  be  three  main  reasons 
for  this  slow  acceptance.   First,  the  Rasch  procedure  is  based  on  a 
mathematical  model  involving  restrictive  assumptions,  e.g.,  the  uni- 
dimensionality  of  the  items,  the  local  independence  of  the  items,  and 
equal  item  discriminations.   A  further  restriction  of  the  Rasch  model 
is  the  assumption  of  minimal  guessing.   However,  several  researchers 
have  demonstrated  the  robustness  of  the  model  with  regard  to  departures 
from  the  basic  assumptions  (Anderson,  Kearney  and  Everett,  196S;  Dinero 
and  Haertel,  1976;  Rentz,  1976).   Second,  latent  trait  theory  has  not  been 
used  in  practical  testing  situations  because  until  recently  there  was  a 
lack  of  available  computer  programs  to  handle  the  complex  mathematical 
calculations.   Hambleton  et  al.  (1977)  described  four  computer  programs 
now  available  to  the  consumer.   Third,  measurement  experts  who  are 
knowledgeable  about  latent  trait  models  have  been  skeptical  as  to  tlie 
real  gains  that  may  be  available  through  this  line  of  research.   Are 
tests  developed  using  latent  trait  models  superior  to  tests  developed 
using  classical  item  analysis  or  factor  analysis? 

The  purpose  of  this  study  was  to  compare  the  precision  and  efficiency 
of  cognitive  tests  constructed  by  tlie  three  methods  (classical  item 
analysis,  factor  analysis  and  the  Rasch  model)  from  a  common  item  and 
examinee  population.   Precision,  as  measured  by  internal  consistency, 
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is  an  overall  estimate  of  a  test's  homogeneity,  but  provides  no  infor- 
mation on  how  the  test  as  a  whole  discriminates  for  tlie  various  ability 
groups  taking  tlie  test.   For  that  reason  measures  test  efficiency 
(Lord,  1974a,  1974b)  were  incorporated  into  the  study.   Test  efficiency 
provides  information  on  tlie  effectiveness  of  one  test  over  another  as 
a  function  of  ability  level.   A  cognitive  college  admissions  subtest  was 
used  in  this  study  for  several  reasons.   First,  tests  of  this  t)^e   are 
widely  used  by  educational  institutions  for  a  large  number  of  examinees 
each  year,  in  the  areas  of  selection,  placement,  and  academic  counseling. 
Most  college  admission  examinations  traditionally  have  been  developed 
using  classical  item  analysis.   Second,  because  of  the  importance  of 
the  decisions  made  using  such  test  scores,  it  would  be  worth  investing 
considerable  time  and  expense  in  the  development  of  these  instruments. 
Thus,  the  use  of  factor  analysis  or  the  Rascli  model  would  be  justified 
if  superiority  of  either  of  these  methods  over  classical  item  analysis 
could  be  determined.   Third,  tlie  items  on  college  admission  tests  have 
been  written  by  experts,  and  each  subtest  is  intended  to  be  unidimensional , 
e.g.,  items  measuring  a  single  ability.   Thus,  assumptions  from  all 
models  should  be  met.   Fourth,  because  of  the  time  required  to  take  such 
examinations,  it  is  important  to  maximize  the  precision  and  tlie  effect- 
iveness of  tlie  tests.   Tlie  possibility  of  using  fewer  items  while  main- 
taining precision  would  be  desirable.   Therefore,  the  question  of  which 
test  development  procedure  can  best  accomplisli  this  is  not  a  trival  one. 

Purpose  of  the  Study 

The  purpose  of  this  study  was  to  compare  empirically  the  Rasch  model 
with  classical  item  analysis  and  factor  analysis  in  test  development. 
Five  research  questions  guided  tiiis  study. 

1.   Kill  the  three  methods  of  test  development  produce  tests  with 
superior  internal  consistency  estimates  when  compared  to  the  jirojected 


internal  consistency  of  the  population  as  the  number  of  items  decreases? 

2.  Will  the  three  methods  of  test  development  produce  tests 
with  stable  estimates  of  internal  consistency  when  the  number  of 
examinees  decreases? 

3.  Will  the  tliree  methods  of  test  development  produce  tests 
with  similar  standard  errors  of  measurement? 

4.  Will  the  three  methods  of  test  development  select  items  that 
are  similar  in  terms  of  difficulty  and  discrimination? 

5.  Will  the  three  methods  of  test  development  produce  equally 
efficient  tests  for  all  ability  levels? 

Hypotheses 

This  study  investigated  the  capacities  of  three  methods  of  test 
development  to  increase  precision  and  efficiency  of  measurement  in  test 
construction.   Tlie  five  questions  posited  in  the  previous  section  were 
phrased  as  testable  hypotheses: 

1.   There  are  no  significant  differences  in  the  internal  consistency 
estimates  of  the  tests  produced  by  the  three  methods,  as  the  number  of  items 
decreases,  wlien  compared  to  the  projected  internal  consistency  estimates 
for  the  population  for  tests  of  similar  length. 


The  standard  error  of  measurement  (SEM)  is  defined  in  the  classical 
sense  as  (Magnusson,  1966,  p.  79): 

SEM  =  Sv   vT~~r 

XX, 

where  Sv  is  the  standard  deviation  of  the  test,  and  r   is  the 

XX 

reliability  coefficient. 
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2.  There  are  no  differences  in  the  internal  consistency  estimates 
of  the  tests  produced  by  the  three  methods  when  the  number  of  examinees 
is  decreased. 

3.  There  are  no  meaningful"  differences  in  the  magnitude  of  the 
standard  error  of  measurement  of  the  tests  produced  by  the  three 
methods . 

4.  There  are  no  significant  differences  in  the  difficulties  or 
discriminations  of  the  items  selected  by  the  three  methods. 

5.  There  are  no  differences  across  ability  levels  in  the  efficiency 
of  the  tests  produced  by  the  three  methods. 

Significance  of  the  Study 
Objective  measurement  has  always  been  assumed  in  the  physical 
sciences.    It  has  only  been  recently  that  objective  measurement  in  the 
behavioral  sciences  has  been  deemed  possible  with  the  advent  of  latent 
trait  theory.   Since  the  introduction  of  latent  trait  theory  by 
Lazarsfeld  (1950)  and  Lord  (1952a,  1953a,  1953b)  much  of  the  research 
on  latent  trait  models  has  been  confined  to  theoretical  research  journals, 
Wright  (1968),  speaking  at  a  conference  on  testing  problems,  discussed 
at  an  applied  level  the  need  to  seriously  consider  latent  trait  theory 
and  the  Rasch  model  in  particular  as  a  major  test  development  technique 
far  superior  to  classical  item  analysis  and  factor  analysis.   However, 
even  in  1968  computer  programs  were  not  yet  available  to  run  the  analyses 


2 
Because  test  scores  are  usually  reported  and  interpreted  in 

whole  numbers,  a  "meaningful"  difference  in  the  standard  error  of 

measurement  is  defined  as  a  difference  of  ii  1.00. 
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should  anyone  beyond  academicians  be  intex'csted.   Today  this  obstacle 
has  been  overcome,  but  many  test  developers  remain  unconvinced  of  the 
value  of  latent  trait  theory  because  its  superiority  to  classical  test 
theory  has  not  been  conclusively  demonstrated.   This  study  is  an  attempt 
to  provide  an  empirical  comparison  of  classical  test  theory  and  latent 
trait  tlieory  methods  of  test  construction. 

Of  the  various  logistic  models  that  represent  latent  trait  theory 
the  Rasch  model  was  chosen  for  comparison  with  traditional  item 
analysis  procedures  in  the  present  study  because  it  is  the  most 
parsimonious  latent  trait  model  and  has  been  used  recently  in  the 
development  of  the  equating  of  tests  fRentz  and  Bashaw,  1977;  Woodcock, 
1974).   Tlie  Rascli  model  provides  a  matliematical  explanation  for  the 
outcome  of  an  event  when  an  examinee  attempts  an  item  on  a  test.   Rasch 
(1966)  stated  that  the  outcome  of  an  encounter  is  governed  by  the  pro- 
duct of  tlie  ability  of  tlie  examinee  and  the  easiness  of  the  item  and 
nothing  more.   The  imi)lication  of  this  simple  concept  (objectivity  of 
measurement)  would  seem  to  revolutionize  mental  measurement.   If  invariant 
properties  of  items  and  ability  scores  can  be  identified  and  used  to 
improv'e  the  psychometric  quality  of  tests  to  an  extent  greater  than  now 
possible  with  classical  and  factor  analytic  procedures  then  \ip  ti-uly 
are  in  the  age  of  modern  test  theory. 

Organization  of  the  Study 

The  theoretical  and  empirical  studies  related  to  the  three  methods 
of  item  analysis  are  described  in  Chapter  II.   An  empirical  investigation 
to  compare  tlie  three  metliods  of  item  analysis  under  varying  conditions 
is  described  in  Chapter  III.   The  results  of  the  study  are  reported  in 
Chapter  IV.   A  discussion  of  the  results,  conclusions  of  the  study,  and 
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implications  for  future  research  in  this  area  have  been  presented  in 
the  fiftli  chapter.   A  sujiimarization  of  the  study  lias  been  provided  in 
Chapter  VI. 


CHAPTIZR  II 
REVIEW  OF  THE  LITEilATURE 

The  quality  of  the  items  in  a  test  determine  its  validity  and 
reliability.   Tlirough  tlie  application  of  item  analysis  procedures,  test 
constructors  are  able  to  obtain  quantitative  objective  information 
useful  in  judging  the  quality  of  test  items.   Item  analysis  thus  pro- 
vides an  empirical  basis  for  revising  the  test,  indicating  which  items 
can  be  used  again  and  which  items  have  to  be  deleted  or  rewritten 
(Lange,  Lehmann,  and  Mehrens ,  1967).   Item  analysis  data  also  help 
settle  arguments  and  objections  to  specific  items  that  might  be  raised 
by  administrators,  test  experts,  examinees,  or  the  public. 

This  study  is  focused  on  three  approaches  to  item  analysis  (classical 
item  analysis,  factor  analysis,  and  the  Rasch  model)  as  test  construction 
techniques.   It  is  assumed  throughout  this  study  that  the  test  under 
construction  is  unidimensional ,  e.g.,  all  items  are  measuring  only  one 
ability.   These  three  approaches  to  item  analysis  and  the  relevant 
research  related  to  each  method  are  discussed  in  this  chapter. 
Item  Analysis  Procedures  for  the  Classical  Model 

Item  analysis  as  a  test  development  technique  emerged  at  the  begin- 
ning of  this  century.   Binet  and  Simon  (1916)  were  among  the  first  to 
systematically  validate  test  items.   They  noted  the  proportion  of 
students  at  particular  age  levels  passing  an  item.   This  statistic  was 
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measuring  the  relative  difficulty  of  the  items  for  different  age  groups. 
The  item  difficulty  index,  defined  as  the  percentage  of  persons  passing 
an  item  and  denoted  by  p,  is  one ' of  the  statistics  used  in  classical 
item  analysis. 

Item  difficulty  is  related  to  item  variance  and  hence  to  the 
internal  consistency  of  the  test.   Test  constructors  are  usually  con- 
cerned with  achieving  high  test  reliability,  e.g.,  precision  of  measure- 
ment.  Therefore,  an  item  difficulty  of  .50  is  considered  to  be  the  ideal 
value  necessary  to  maximize  test  reliability.   Tliis  is  because  half 
the  examinees  are  getting  the  item  correct  and  half  the  examinees  are 
missing  the  item.   The  proportion  missing  an  item  is  defined  as  1-p  or 
q.   Tlius,  when  p  is  equal  to  .50,  q  is  equal  to  .50.   Uecause  the 
variance  of  a  dichotomized  item  is  p  x  q  tlie  maximum  variation  an  item 
can  contribute  to  total  test  variance  and  ultimately  to  true-score 
variance  is  .25.   As  an  item's  difficulty  index  deviates  from  .50,  its 
contribution  to  total  test  variance  is  always  some  value  less  than  .25. 
Hence  test  constructors  have  been  advised  (Gulliksen,  1945)  to  select 
items  with  difficulty  indices  at  or  near  .50.   However,  when  items 

are  presented  in  multiple  choice  or  alternate  choice  format,  the  ideal 

3 

level  of  difficulty  is  adjusted  to  accommodate  for  guessing. 

A  second  important  item  statistic  in  classical  item  analysis  is 
the  item  discrimination  index.   An  item  discrimination  index  is  a  measure 


• 


The  ideal  value  of  p  =  .50  assumes  there  has  been  no  guessing  on 
the  item.   Tlie  effects  of  guessing  on  item  difficulty  tends  to  increase 
the  ideal  value  of  p.   For  example,  on  a  four  option  multiple  choice 
item  the  chance  of  guessing  the  correct  answer  is  (J4)  ( .501  =  .  12.   The 
value  of  .12  is  added  to  .50  to  correct  for  the  effect  of  guessing  and 
the  ideal  p  would  now  be  .62  (Lord,  1952b;  Mehrens  and  Lehmann,  1975). 
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of  how  well  the  item  discriminates  between  persons  who  have  high  test 
scores  and  persons  who  have  low  test  scores.   Tlie  discrimination  index 
is  often  expressed  as  a  correlation  between  the  item  and  total  test 
score.   Ulien  the  criterion  is  total  test  score,  the  correlation  coef- 
ficient indicates  the  contribution  that  item  makes  to  the  test  as  a 
whole,   'riius,  on  tests  of  academic  achievement  it  is  a  measure  of  item 
validity  as  well  as  a  contributor  to  internal  consistency.   Noting  an 
increasing  use  of  item  analytic  procedures  for  the  improvement  of 
objective  examinations,  Richardson  (1936)  pointed  out  that  the 
development  of  tlie  procedures  of  item  analysis  had  centered  primarily 
around  the  invention  of  various  indices  of  association  between  the  test 
item  and  tiie  total  test  score,  e.g.,  item  discrimination  indices. 

The  two  most  popular  item-test  correlation  indices  are  the  biserial 
and  point  biserial  correlations.   The  point  biserial  was  developed  by 
Pearson  (1900)  and  is  a  special  case  of  the  more  general  Pearson  Product 
Moment  (PPM)  correlation  coefficient  (Magnusson,  1966).   Tliis  index 
is  recommended  when  one  of  the  variables  being  correlated  (the  item 
score)  represents  a  true  dichotomy  and  the  other  variable  (total  test 
score)  is  continuously  distributed.   Pearson  (1909)  also  derived  the 
biserial  correlation  which  is  an  estimate  of  the  PPM.   The  biserial 
correlation  is  recoiimiended  when  one  of  the  variables  (the  item  score) 
has  an  underlying  continuous  and  normal  distribution  which  has  been 
artifically  dichotomized  and  the  other  variable  (total  test  score)  is 
continuously  distributed.   The  assumption  for  the  point  biserial 
correlation  is  often  hard  to  justify  when  it  is  suspected  that  knowledge 
required  to  answer  an  item  is  continuously  distributed. 


16 


In  considei'ing  the  dichotomized  item  [pass/fail],  McNemar  (1962] 
has  commented,  "It  is  obvious  that  failing  a  test  item  represents  any- 
thing from  a  dismal  failure  up  to  a  near  pass,  whereas  passing  the 
item  involves  barely  passing  up  to  passing  with  the  greatest  of  ease" 
Cp.  191).   Thus,  the  biserial  correlation  is  usually  favored  over  the 
point  biserial  correlation  as  a  measure  of  item  discrimination.   Also, 
the  biserial  is  often  chosen  over  the  point  biserial  because  the 
magnitude  of  the  point  biserial  correlation  for  an  item  is  not  in- 
dependent of  the  item  difficulty  (Davis,  1951;  Henrysson,  1971; 
Swineford,  1956) .   Specifically,  values  of  the  point  biserial  are 
systematically  depressed  as  p  approaches  the  extremes  of  .00  or  1.00. 
Lord  and  Novick  (196S)  have  pointed  out  that  because  of  this  bias,  the 
point  biserial  correlation  tends  to  favor  medium  difficulty  items  over 
easy  or  very  difficulty  items. 

The  formulae  for  the  biserial  and  point  biserial  correlation 
respectively  are  (Magnusson,  1966,  p.  200  §  203): 


^bis   = 
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where  X  is  the  mean  of  y  scores  for  persons  who  correctly  solved  the 
item,  X  is  the  mean  of  y  scores  for  persons  who  incorrectly  solved 
the  item,  s  is  the  standard  deviation  of  the  y  test  scores,  p  and  q 
have  been  previously  defined,  and  Y  is  the  ordinate  of  the  dividing 
line  between  the  proportions  p  and  q  in  a  unit  normal  distribution 
(Magnusson,  1966). 
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One  of  tlie  main  objectives  of  classical  test  theory  is  to  improve 
the  internal  consistency  of  the  test  under  construction  where  internal 
consistency  was  defined  as  the  extent  to  which  all  items  are  measuring 
the  same  ability.   To  ensure  high  internal  consistency  the  random  error 
in  the  test  must  be  minimized.   As  stated  previously  in  Equation  2, 
reliability,  in  the  classical  model,  was  defined  as: 


r   =    "t^   =  1  -    °e2 

XX 


0^2  0^2 

Thus,  the  relationship  among  the  test  items  can  be  noted  in  the 
coefficient  alpha  formulae  for  estimating  internal  consistency  for  a 
sample  (Magnusson,  1966,  pp. 116-117): 

2 
r   =     n     ,  1  -   ESi     ,         ,^. 
XX       _  ^-  (5) 
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where  n  is  the  number  of  test  items,  Z  S.   is  the  sum  of  the  item 

1 

2  

variances,  S.,"  is  the  variance  of  the  test,  and  C,  is  the  mean  of  the 
X  ik 

item  covariances.   By  comparing  equation  2  witli  5,  it  is  seen  that  the 

2 
sum  of  the  unique  item  variances  is  used  as  an  estimate  of  cr  ",  and  that 

when  the  unique  item  variation  is  minimized  internal  consistency  will 

be  high.   Furthermore,  the  mean  of  the  item  covariances  (equation  6) 

2 
serves  as  an  estimate  of  a   .   'Hie  size  of  the  covariance  term  is  in 

turn  determined  by  the  intercorrelations  and  standard  deviations  of 

the  items  (Magnusson,  1966).   Therefore,  internal  consistency  is  directly 

dependent  upon  the  correlation  amons  the  items  in  the  test. 
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The  item  discrimination  index  provides  a  measure  of  how  well  i'n 
item  contributes  to  what  the  test  as  a  whole  measures.  W\en   items  with 
tlie  highest  item-test  correlations  are  selected,  the  homogeneity  of 
the  test  is  increased;  tliat  is,  o  ~  is  increased.   So  it  is  the  item 
discrimination  that  directly  affects  test  reliability.  V/lien  items  with 
low  item-test  correlations  are  eliminated,  the  remaining  item  inter- 
correlations  are  raised.   Wlien  item-test  correlations  are  high,  the  test 
is  able  to  discriminate  between  higli  and  low  scorers  and  hence  internal 
consistency  is  increased.   If  too  few  items  are  discarded  in  an  item 
analysis  tlie  internal  consistency  of  tlie  test  tends  to  decrease  because 
items  with  little  power  of  measuring  what  tlie  entire  test  is  intended 
to  measure  will  dilute  the  measuring  power  of  the  efficient  items 
(Beddell,  1950). 
Research  Related  to  Classical  Item  Ajialysis  in  Test  Development 

Several  articles  have  been  published  concerning  standards  for  item 
selection  to  maximize  test  validity  and  increase  internal  consistency. 
Flanagan  (1959J  stated  two  considerations  in  selecting  test  items: 
(a)  the  item  must  be  valid,  that  is,  it  should  discriminate  between 
high  and  low  scorers,  and  [b)  the  level  of  item  difficulty  should  be 
suitable  for  the  examinee  group.   Gulliksen  (1945)  agreed  with  Flanagan 
on  these  two  points  and  added  a  third;  items  selected  with  p  =  .50  would 
produce  tlie  most  valid  tests;   liowever,  Gulliksen  noted  that  current 
practice  was  opposed  to  selecting  items  with  difficulty  near  .50.   Test 
developers  were  selecting  items  based  upon  spreading  difficulty  indices 
over  a  broad  range . 

Several  studies  have  been  conducted  to  examine  the  effects  of 
varying  item  difficulty  on  test  development.   Brogden  (1946),  in  a 
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study  of  test  homogeneity,  has  shown  empirically  that  a  test  of  45 
items  witli  varying  levels  of  item  difficulty  produced  a  reliability 
of  .96  (measured  by  tlie  Kuder-Richardson^  formula).   However,  a 
similar  but  longer  test  of  153  items,  that  had  item  difficulties  at 
.50  for  all  items,  produced  reliability  of  .99.   Thus,  Brogden  con- 
cluded tliat  effective  item  selection  was  based  more  on  selecting  a 
test  with  fewer  items  that  possessed  varying  difficulty,  than  a  longer 
test  with  equal  item  difficulty. 

Davis  (1951),  in  coiimientina  on  item  difficulty,  stated  that  if 
all  test  items  had  a  difficulty  of  .50  and  were  uncorrelated  then 
maximum  discrimination  was  acliieved.   But  when  test  items  were  cor- 
related, maximum  discrimination  would  only  be  achieved  when  the 
difficulty  index  for  all  test  items  was  spread  out,  e.g.,  several 
difficult  items,  several  easy  items,  and  several  items  with  difficulty 
near  .50.   Davis  recommended  the  latter  procedure  for  test  development 
because  test  items  are  usually  correlated  to  some  degree.   Davis  also 
recognized  the  need  for  tlie  approval  of  subject  matter  specialists  in 
addition  to  statistical  criteria  in  item  selection. 

In  a  study  of  test  validity,  Webster  (1956)  foimd  results  similar 
to  Brogden  (1946),  but  different  from  Gulliksen  (1945).   By  selecting 
fewer  items  with  high  discrimination  indices  and  varying  item  difficulty 
levels,  a  more  valid  test  was  produced.   Webster's  results  indicated 
that  a  test  of  178  items  with  difficulty  indices  near  .50  had  a  validity 
coefficient  of  .66.   However,  a  test  of  124  similar  items  with  varying 
item  difficulties  had  a  validity  coefficient  of  .76,  statistically 
significant  at  p  <  .03  (based  on  £  to  £  transformations). 
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Myers  (1962),  concerned  by  the  current  practice  of  selecting 
items  based  on  varying  item  difficulties  instead  of  the, theoretical 
idea  of  p  =  .50,  compared  tlie  effect  of  the  current  practice  to  the 
theoretical  idea  on  reliability  and  validity  of  a  scholastic  aptitude 
test.   Tlie  ideal  item  difficulty  ranged  from  .40  to  .74  in  what  he 
called  the  peaked  test.   Items  selected  by  the  current  practice  were 
outside  the  above  range,  and  Myers  called  this  the  U-shaped  test. 
Two  sets  of  il  -Tis   were  selected  for  the  peaked  test  and  the  U-shaped 
test,  four  tests  in  all.   Myers  reported  no  statistically  significant 
differences  in  test  validity  when  the  different  tests  were  correlated 
with  freshman  grades.   Test  reliability  was  statistically  significant 
at  P  <  .02  (using  the  Wilcoxon  matched  pairs  sign  test)  in  favor  of  the 
peaked  test.   The  reliability  of  the  peaked  test  was  .69.   The 
reliability  of  the  U-shaped  test  was  .65.   The  author  noted  that  the 
results  above  were  based  on  a  24  item  test,  and  that  when  test  length 
was  projected  to  48  items  (via  Spearman- Brown  Prophecy  Formula)  there 
were  no  significant  differences  in  test  reliability.   Tlie  studies  of 
Brogden  (1946)  and  IVebster  (1956)  indicate  that  selecting  items  of 
varying  item  difficulty  tends  to  increase  internal  consistency  and  test 
validity.   Tlie  results  from  Myer's  (1962)  study  indicated  just  the 
opposite,  that  item  difficulty  near  .50  produced  the  more  internally 
consistent  test.   But  this  was  only  true  for  a  relatively  short  test 
of  24  items,  and  that  when  the  test  length  was  projected  to  48  items, 
there  were  no  differences  in  the  reliability  of  cither  test  based  upon 
the  two  methods  of  selecting  items. 
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Simplified  Methods  of  Obtaining  Item  Discriminations 

A  second  major  group  of  articles  on  classical  item  analysis  has 
dealt  witli  simplified  metliods  of  obtaining  indices  of  item  discrimination. 
Because  of  the  lack  of  computers  in  the  early  years  of  test  develop- 
ment many  psychometricians  concerned  themselves  with  devising  tables 
to  provide  quick  estimates  of  item  discrimination.   Kelley  ("1939) 
found  that  in  the  computation  of  item  discrimination  only  54  percent  of 
the  examinee  group  (based  on  total  test  score)  needed  to  be  used. 
Considering  the  top  27  percent  and  the  bottom  27  percent  of  the  test 
scorers  resulted  in  a  considerable  savings  in  computational  time. 
Flanagan  (1939)  developed  a  table  of  item  discriminations  to  estimate 
tlie  PPM  correlation  between  item  and  test  score  based  on  Kelley's 
extreme  score  groups  of  top  and  bottom  27  percent. 

Fan  (1952)  developed  a  table  for  the  estimation  of  the  tetrachoric 
correlation  coefficient  using  the  upper  and  lower  27  percent  of  the 
scorers,   ilie  tetrachoric  correlation  is  similar  to  the  biserial 
correlation,  where  the  correlation  is  between  two  variables,  which  are 
assumed  to  have  a  normal  and  continuous  underlying  distribution,  but 
have  been  artifically  dichotomized. 

Guilford  (1954)  presented  several  short  cut  tabular  and  graphic 
solutions  for  estimating  various  types  of  correlation  coefficients  to 
measure  test  item  validity.   Tliese  methods  result  in  saving  a  consider- 
above  amount  of  time  when  one  is  forced  to  use  hand  calculations. 
Today  these  short  cut  methods  can  be  used  by  classroom  teachers  who  often 
do  not  have  the  aid  of  calculators  or  computers.   However,  many  test 
constructors  still  use  these  classical  methods  of  item  analysis  even 
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though  computers  are  available  with  which  more  sopliist icated  item 
analytic  techniques  such  as  factor  analysis  or  latent  trait  models 
can  be  used. 

Item  Analysis  Procedures  for  the  Factor  Analytic  Model 
Charles  Spearman  [1904)  proposed  a  theory  of  measurement  based  on 
the  idea  that  every  test  was  composed  of  one  general  factor  and  a  num- 
ber of  specific  factors.   In  order  to  test  liis  idea  Spearman  developed 
the  statistical  procedure  known  as  factor  analysis. 

"Factor  analysis  is  a  method  of  analyzing  a  set  of 
observations  from  their  intercorrelations  to  determine 
whether  tlie  variations  represented  can  be  accounted  for 
adequately  by  a  number  of  basic  categories  smaller  than 
that  with  which  the  investigation  started"  (Fruchter, 
1954,  p.  1). 

Factor  analysis  is  a  mathematical  procedure  which  produces  a  linear 

representation  of  a  variable  in  terms  of  other  variables  (Harman,  1967). 

In  the  case  of  test  items  being  factor  analyzed,  a  matrix  of  item 

intercorrelations  is  obtained  first.   Subsequently,  the  matrix  of  item 

correlations  is  submitted  to  tlie  factoring  process.   There  are  two 

basic  alternatives  within  the  framework  of  factor  analysis  for  analyzing 

a  set  of  data:   coimnon  factor  analysis,  based  on  the  work  of  Spearman 

and  later  Thurstone  (1947);  and  principal  components,  developed  by 

Hotelling  (1953).   Tlie  major  distinction  between  the  two  methods  relates 

to  the  amount  of  variance  analyzed,  e.g.,  the  values  placed  in  tlie 

diagonal  of  the  intercorrelation  matrix.   Factoring  of  the  correlation 

matrix  with  unities  in  the  diagonal  leads  to  principal  components,  while 

^     .  4 

factoring  the  correlation  matrix  with  communalities   in  the  diagonal 


4,  '' 

The  communality  (h~>,  of  a  variable  is  defined  as  the  sum  of  the 

squared  factor  loadings  h"  =  a.  +  a.^^  +  ...  a."  (llarman,  1967,  p.  17), 

see  formula  8.  J  3-                     jn 
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leads  to  common  factor  analysis  (llarman,  1967) .   If  it  is  of  interest 
to  know  what  the  test  items  share  in  common,  a  common  factor  solution 
is  warranted.   But  if  it  is  of  interest  to  make  comparisons  to  other 
tests  or  other  test  development  procedures,  a  principal  components 
solution  is  warranted.   Since  the  present  study  was  initiated  to  com- 
pare three  different  test  development  techniques,  a  principal  com- 
ponents solution  was  used  in  this  study  to  analyze  the  date  under  the 
factor  analytic  model. 

The  linear  model  for  the  principal  components  procedure  is 
defined  as  (Harman,  1967,  p.  15): 

Z..  =  a.,F,  +  a.-F-  +  . . .a.  F  .  (7) 

ji    jl  1    j2  2       jn  n 

Z..  is  the  variable  for  item)  of  interest,  and  a.,  is  the  coefficient, 
Jl  jl 

or  more  frequently  referred  to  as  the  loading  of  variable  Z..  on  com- 
ponent, (F,).   An  important  feature  of  principal  components  is  that  the 
extracted  components  account  for  the  maximum  amount  of  variance  from 
the  original  variables.   Each  principal  component  extracted  is  a  linear 
combination  of  the  original  variables  and  is  uncorrelated  with  sub- 
sequent components  extracted.   Thus,  the  sum  of  the  variances  of  all 
n  principal  components  is  equal  to  the  sum  of  the  variances  of  the 
original  variables  (Harman,  1967).   According  to  Guertin  and  Bailey 
(1970),  the  principal  components  solution  was  designed  basically  for 
prediction,  hence  the  need  to  use  the  maximum  amount  of  variance  in  a 
set  of  variables . 

Since  factor  analysis  is  based  upon  a  matrix  of  intercorrelations , 
it  is  important  tliat  care  be  taken  in  selecting  the  appropriate 
coefficient.   Several  item  coefficients  are  available:   phi,  phi/phi 
max,  and  the  tetrachoric  correlation  coefficient.   Carroll  (1961) 
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pointed  out  several  problems  concerning  the  clioice  of  a  correlation 
coefficient  to  be  used  in  factor  analysis.   The  phi  coefficient  (used 
where  both  variables  are  true  dichotomies)  was  found  to  be  affected 
by  disparate  marginal  distributions  and  often  imderestimated  tlie  PPM. 
The  phi/phi  max  coefficient  was  developed  to  correct  for  the  under- 
estimation of  phi,  but  the  correction  is  not  enoush  to  counter  the 
effect  of  extreme  dichotomizations.   Carroll  recoiranended  the  tetrachoric 
coefficient  as  being  the  least  biased  by  extreme  marginal  splits 
providing  the  variable  uiider  consideration  was  nori:ially  distributed  in 
the  population.   Wherry  and  Winer  (1955)  had  made  conclusions  similar 
to  Carroll,  but  went  on  to  say  that  when  the  normality  assumption  was 
met  and  the  regression  of  test  score  on  the  item  was  linear  the  PPM 
and  tetraclioric  are  identical.   Tlie  tetrachoric  correlation  was  used  in 
the  present  study  to  obtain  item  intercorrelations . 
Research  Related  to  Factor  Analysis  in  Test  Development 

The  early  use  of  factor  analysis  to  construct  and  refine  tests 
was  suggested  by  the  work  of  McNemar  (1942)  in  revising  the  Stanford- 
Binet  scales,  and  Burt  and  John  (1943)  in  analyzing  the  Terman-Binet 
scales. 

Several  contemporary  psychometricians  have  advocated  the  use  of 
factor  analysis  in  developing  unidimensional  tests  (Cattell,  1957; 
Hambleton  and  Traub,  1975;  Henrysson,  19b2;  Lord  and  Novick,  1968).   A 
unidimensional  test  was  defined  briefly  in  the  introduction  to  this 
chapter,  but  a  more  precise  definition  is  warranted.   Lumsden  (1961) 
noted  that  a  unidimensional  test  can  be  determined  by  the  examinee 
response  patterns.   If  the  test  items  are  arranged  from  easiest  to 
hardest,  person^  who  misses  item  will  miss  all  the  other  items,  and 
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person-,  who  gets  item  correct  but  misses  item^  will  miss  all  the 
subsequent  items  and  so  on.   The  above  statement  assumes  infallible 
items.   However,  most  tests  constructed  today  contain  fallible  items, 
thus  the  response  pattern  will  be  disturbed  by  random  error.   Lumsden 
suggested  in  developing  unidimensional  tests  factorial ly  that  the  items 
be  carefully  selected  on  empirical  grounds,  tlms  reducing  the  problem  of 
too  many  heterogeneous  items  and  the  possibility  of  obtaining  multiple 
factors.   By  preselecting  items  one  increases  the  chances  of  the  items 
converging  on  one  factor. 

The  importance  of  developing  unidimensional  tests  is  demon- 
strated most  clearly  in  considering  the  concepts  of  test  reliability 
and  validity.   For  a  test  to  be  valid  it  must  actually  measure  the  trait 
it  was  intended  to  measure.   For  a  test  to  be  reliable  it  must  provide 
similar  results  upon  repeated  measurement.   It  should  be  easier  to 
estimate  these  two  important  aspects  of  a  test  when  the  test  is 
unidimensional  than  when  the  test  is  multidimensional,  hence  the  use  of 
a  unidimensional  test  in  the  present  study. 

Cattell  (1957)  has  suggested  that  in  the  development  of  a  factor 
homogeneous  scale,  one  should  preselect  items,  carry  out  a  preliminary 
factor  analysis,  then  select  for  further  analysis  those  items  which 
load  on  the  first  factor.   Cattell  defined  an  index  of  unidimensionality 
as  the  ratio  of  the  variance  of  the  first  factor  to  the  total  test 
variance.   This  index  has  no  set  criterion  and  the  sampling  distribution 
is  unknown. 
Comparison  of  Factor  Analysis  to  Classical  Item  Analysis 

One  measure  of  item  validity,  the  biserial  correlation  was  described 
for  classical  item  analysis  procedures.   This  same  index  is  also  obtained 
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by  factor  analysis.   U'lien  tiie  test  items  are  factor  analyzed,  the  factor 

loading  a...  is  the  item- factor  association  that  is  considered  a 
o   ij 

measure  of  item  validity,  e.g.,  the  liigher  tlie  factor  loading,  the 
greater  the  relationship  between  the  item  and  the  factor  it  iieasures. 
Tlie  factor  loadings  can  be  viewed  as  similar  to  the  biserial  correlations 
discussed  under  classical  test  theory.   Tliis  relationship  between 
factor  loadings  and  biserial  correl:  tions  has  been  discussed  by  several 
authorities  (Guertin  and  Bailey,  197;';   Henrysson,  1962;  Richardson, 
1936). 

Factor  analysis  as  an  item  analytic  technique  was  not  realistically 
possible  for  most  psychometricians  until  the  advent  of  high  speed 
computers.   Guertin  and  Bailey  (1970)  iiave  predicted  that  with  the 
increasing  use  of  computers  factor  analysis  will  replace  classical  item 
analysis  as  a  test  development  technique.   Because  it  is  possible  for 
a  test  to  reach  the  highest  degree  of  homogeneity  and  yet  be  factorial ly 
a  very  odd  mixture  of  factors  (Cattell  and  Tsujioka,  1964),  classical 
item  analysis  alone  is  not  sufficient  to  determine  if  a  test  is 
unidimensional .   However,  factor  analysis  not  only  provides  a  measure 
of  item-test  correlation  (the  factor  loading),  it  also  provides  an 
indication  of  how  many  items  form  a  unifactor  test.   Thus,  factor 
analysis  has  been  advocated  as  a  superior  technique  to  classical  item 
analysis  (Guertin  and  Bailey,  1970).   Using  factor  analysis  in  test 
development,  psycliometricians  liave  advanced  beyond  an  independent 
analysis  of  item  intercorrelations  to  a  simultaneous  analysis  of  item 
intercorrelations  with  other  individual  items  to  obtain  a  measure  of 
test  unidimensionality  and  item-factor  association. 
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However,  there  is  an  inherent  flaw  in  factor  analysis  as  there 
was  in  classical  item  analysis  in  test  development.   Tlie  flaw  is  that 
both  procedures  are  sample  dependent.   Wien  an  item  analysis  procedure, 
or  any  procedure  in  general  is  sample  dependent,  it  means  that  the 
results  will  vary  from  group  to  group,   fflien  the  groups  are  very 
dissimilar,  there  is  much  variability.   Gulliksen  (1950)  noted  that  a 
significant  advance  in  item  analysis  tlieory  would  be  made  when  a 
method  of  obtaining  invariant  item  parameters  could  be  discovered.   To 
that  end  latent  trait  theory  is  an  attempt  to  identify  invariant 
item  parameters. 

Item  Ajialvsis  Procedures  for  the  Latent  Trait  Model 
Latent  trait  theory  specifies  a  relationship  between  the  observable 
examinee  test  performance  and  the  unobservable  traits  or  abilities 
assumed  to  underlie  performance  on  a  test  (liambleton  et  al.,  1977).   The 
relationship  is  described  by  a  mathematical  function;  hence  latent 
trait  models  are  mathematical  models.   As  noted  earlier,  there  are  four 
major  latent  trait  models  for  use  with  dichotomously  scored  data:   the 
normal  ogive,  and  tlie  one-,  two-,  and  three-parameter  logistic  models 
(Hambleton  and  Cook,  1977;  Lord  and  Novick,  196S) .   All  four  models  are 
based  on  the  assumption  that  the  items  in  the  test  are  measuring  one 
common  ability  and  that  the  assumption  of  local  independence  exists 
between  the  items  and  examinees.   These  two  assumptions  imply  that  a 
test  which  measures  only  one  trait  or  ability  will  have  less  measurement 
error  in  tlie  test  score  than  a  test  that  is  multidimensional,  and  that 
tl;e  response  of  an  examinee  to  one  item  is  not  related  to  his  response 
on  any  other  item.   Ifliere  the  latent  trait  models  begin  to  differ  is 
with  respect  to  the  shape  of  their  item  characteristic  curves. 
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Tlxe   normal  ogive,  develoj^ed  by  Lord  (1952a,  1955a),  produces 

an  item  characteristic  curve  based  on  the  following  formula: 

a  (e  -  b  )  (8) 

Pg  CQ)  =  /  f      ^   e(t)dt, 

where  P^  (6)  is  the  probability  that  an  examinee  with  ability  6  correctly 
answers  item  g,  6(t)  is  the  normal  density  function,  b  represents  item 
difficulty  and  a  represents  item  discrimination. 


The  item  characteristic  curve  of  thertwo-parameter  logistic  model 
developed  by  Birnbaum  (1968)  has  the  same  shape  as  the  normal  ogive, 
and  Baker  (1961)  has  shown  them  to  be  equivalent  mathematical  procedures. 
Tlie  shape  of  the  item  characteristic  cui-ve  of  the  two-parameter 
logistic  function  is  developed  from  the  following  formula: 

Dag(e  -  b  ) 
'^    '''    -        ,  .  e  ^-^  '^  -  V        ■  '"' 

P   (9),  ^  and  b  have  the  same  interpretation  as  in  the  normal  ogive. 

&  o  t> 

D  is  a  scaling  factor  equal  to  1.7  (the  adjustment  between  the  logistic 
function  and  normal  density  functioii)  ,  and  e  is  the  natural  log  function. 
In  Figure  la  the  shape  of  the  normal  ogive  and  the  two-parameter 
logistic  curve  has  been  illustrated.   In  the  Figure,  item  A  is  more 
discriminating  tlian  item  B  as  noted  by  the  steepness  of  the  slopes. 

Ilie.  three-parameter  logistic  model  also  developed  by  Birnbaum  (1968) 
includes  as  an  additional  parameter,  an  index  for  guessing.   The 
mathematical  form  of  the  three-parameter  logistic  curve  is  denoted, 

^.Dag(9  -  b  ) 


S  ^"^    '8    ^*    ^g'  1  +  e  ^-^^    ^^  -  ^^ 


Tlie  parameter  c   the  lower  asympotote  of  the  item  characteristic  curve, 
represents  the  probability  of  low  ability  examinees  correctly  answering 
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an  item  (llambleton  et  al.,  1977).   In  Figure  lb  the  shape  of  the 
three-parameter  logistic  curve  has  been  illustrated.   In  tlie  Figure, 
item  A  is  more  discriiainating  and  has  less  guessing  involved  than  item  B. 

The  one-parameter  logistic  model,  developed  by  Rascli  (1960) 
is  commonly  referred  to  as  the  Rasch  model.   The  Rasch  model,  though 
similar  to  the  other  latent  trait  models,  was  developed  independently 
from  the  other  models.   The  Rasch  model  is  based  upon  two  propositions: 
(a)  the  smarter  an  examinee,  the  more  likely  he  is  to  answer  the  item 
correctly,  and  (b)  an  examinee  is  more  likely  to  answer  an  easy  item 
correctly  than  a  difficult  item.   Matliematical ly  the  above  propositions 
can  be  stated  in  terms  of  odds  or  probability  of  success  on  an  item. 
Tlie  odds  of  an  examinee  with  ability  t)  correctly  answering  an  item  with 
difficulty  <',  is  given  by  the  ratio  of  G  to    c,    (Rasch,  1960): 

odds  =   8     .  (11) 

The  derivation  of  equation  11  was  presented  in  Appendix  A.   Equation  11 
re  formally  written  in  the  following  equation  is  the  Rasch  model. 


mo 


e  f\  -  *i^  (l^) 

Inequation  12,  tlie  probability  of  examinee  k  making  a  correct  response 

to  item  i,  noted  X  =  1 ,  given  an  examinee  of  ability  3,  (where  3,  is  the 

k        k 

log  transformation  of  0)  taking  an  item  of  difficulty  6.  (where  6.  is  the 

1        1 

log  transformation  of  cj    is  a  function  of  the  difference  between  the 
examinee's  ability  and  the  item's  difficulty.   The  derivation  of 
equation  11  to  equation  12  is  presented  in  Appendix  A. 

The  assumptions  for  the  Rasch  model  were  discussed  in  Chapter  I. 
Essentially  the  three  assumptions  are  as  follows: 
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1.  There  is  only  one  trait  underlying  test  performance. 

2.  Item  responses  of  each  examinee  are  statistically  independent. 

3.  Item  discriminations  are  equal. 

The  first  two  assumptions  can  be  checked  by  conducting  a  factor  analysis 
of  the  test  items  as  suggested  by  Lord  and  Novick  (1968) ,  and  Hambleton 
and  Traub  (1973)  .   Tlie  assumptions  are  met  if  one  dominant  factor 
emerges  from  tlie  analysis.   Tlie  tliird  assumption  can  be  checked  by 
plotting  item  characteristic  curves  for  each  item.   In  Figure  Ic  the 
item  characteristic  curves  for  two  hypothetical  items  based  on  the  Rasch 
model  have  been  illustrated.   The  difficulty  for  items  A  and  B  is  .Sand 
L5  respectively  (point  where  p  =  .50),  and  the  discriminations  of  the 
two  items  are  equal.   The  assumption  that  all  items  have  equal  dis- 
criminations is  quite  restrictive;  however,  Rentz  (1976)  demonstrated, 
in  a  simulation  study,  that  the  item  slopes  can  deviate  from  1  (where 
all  slopes  are  equal)  +   .25  and  still  fit  the  model.   In  a  similar 
simulation  study,  Dinero  and  Haertel  (1976)  concluded  that  the  lack  of 
an  item  discrimination  parameter  in  the  Rasch  model  does  not  result 
in  poor  item  calibrations  when  discriminations  are  varied  as  much  as  .25. 

The  estimates  for  the  Rasch  parameters  3,  and  5.,  examinee  ability 
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estimate  and  item  difficulty  estimate  respectively,  are  sufficient, 
consistent,  efficient,  and  unbiased  (Anderson,  1973;  Bock  and  Wood,  1971) 
That  is,  the  examinee's  test  score  will  contain  all  the  information 
necessary  to  measure  the  person  ability  parameter  S,  ,  and  the  sum  of  the 
right  answers  to  a  given  item  will  contain  all  the  information  used  to 
calibrate  the  item  parameter  (S .  (Wright,  1977).   Of  tlie  latent  trait 
models,  the  Rasch  model  is  imique  in  tliis  respect. 


Tlie  matliematical  rationale  of  the  Rasch  noJcl  is  based  upon  the 
separation  of  the  ability  and  item  difficulty  parameters.   As  shown 
in  Appendix  A,  the  estimation  of- the  item  parameters  is  independent  of 
the  distribution  of  ability  and  ability  independent  of  the  distribution 
of  item  difficulty  (Rasch,  1966).   Several  studies  have  demonstrated 
this  (Anderson  et  al.,  1968;  Tinsley  and  Dawis,  1975;  Uliitely  and 
Dawis,  1974;  Miitely  and  Dawis,  1976;  IVright,  1968;  Wright  and 
Panchapakeson,  1969).   The  separation  of  the  ability  and  item  param- 
eters  leads  to  what  Rasch  has  termed  specific  objectivity.   Specific 
objectivity  relates  to  the  fact  that  the  measurement  of  a  person's 
ability  is  not  dependent  upon  the  sample  of  items  used,  nor  the 
examinee  group  in  which  a  person  is  tested.   Once  a  set  of  items  has  been 
calibrated  to  the  Rasch  model,  any  subset  of  tlie  calibrated  items  will 
produce  the  same  estimate  of  the  examinee's  ability.   This  type  of 
objectivity  is  possessed  by  the  physical  sciences  and  the  goal  toward 
which  mental  measurement  should  be  aimed  in  the  future.   Toward  the 
goal  of  objective  measurement  several  researchers  have  conducted 
empirical  studies  comparing  classical  factor  analytic  test  development 
procedures  to  the  latent  trait  models,  and  also  comparisons  have  been 
made  between  the  various  latent  trait  models. 
Research  Related  to  Latent  Trait  Models  in  Test  Development 

Baker  (1961)  conducted  one  of  the  earlier  comparative  studies 
between  two  latent  trait  models.   He  compared  the  effect  of  fitting  the 
normal  ogive  and  the  two-parameter  logistic  model  to  the  same  set  of 
data,  a  scholastic  aptitude  test.   The  two-parameter  model  as  well  as 
the  noi-mal  ogive  provide  item  difficulty  and  item  discrimination  estimates, 
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The  empirical  results  suggest  there  is  little  difference  between  the 
two  procedures  as  measured  by  a  chi-square  test  of  fit.   However,  Baker 
noted  the  computer  running  time  of  the  logistic  model  was  one-third 
that  of  the  ogive  model,  thus  he  concluded  the  logistic  model  was 
more  efficient  in  terms  of  cost  than  the  ogive. 

Hambleton  and  Traub  (1971)  compared  the  efficiency  of  ability 
estimates  provided  by  the  Rasch  model  and  the  two-parameter  model  to 
the  three-parameter  logistic  model  using  Bimbaum's  concept  of  infor- 
mation (1968).   The  three-parameter  model  provides  item  difficulty 
and  discrimination  estimates  as  well  as  accounting  for  guessing  on 
each  item.   Eleven  simulated  tests  of  fifteen  items  each  were  generated 
varying  item  discrimination  and  degree  of  guessing.   ITie  authors 
sought  to  determine  how  efficient  the  one-  and  two-parameter  logistic 
models  were  under  these  conditions  taking  the  three-parameter  model  to 
be  the  true  model.   Tlie  results  indicated  that  wlien  guessing  was  a 
factor  the  three-parameter  model  was  most  efficient  in  providing  ability 
estimates,  but  when  guessing  was  not  a  factor  all  models  were  equally 
efficient.   Since  the  Rasch  model  has  fewer  parameters  to  estimate,  hence 
it  takes  less  computer  time  to  run  than  the  other  two  models,  it  would 
be  preferred  in  the  absence  of  guessing.   In  considering  item 
discrimination,  when  the  guessing  parameter  was  set  to  zero,  the  Rasch 
model  was  as  efficient  as  the  two-parameter  model  when  item  discrimination 
varied  from  .59  to  .79.   As  item  discrimination  deviated  from  this  range 
the  two-parameter  model  was  more  efficient. 

Hambleton  and  Traub  (1975)  compared  the  one-  and  two-parameter 
models  with  three  sets  of  real  data  (the  verbal  and  mathematics  subtests 
of  a  scholastic  aptitude  test  used  in  Ontario  (items  =  45  and  20 
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respectively),  and  tlie  veii)al  section  of  the  Scliolastic  Aptitude  Test 
(SAT,  items  -   80).   llieir  results  indicated  that  generally  tlie  two- 
parameter  model  fit  the  data  better  than  the  one-parameter  model.   The 
loss  in  predicting  performance  was  greatest  on  tlie  shorter  mathematics 
test  and  smallest  on  tlie  longer  SAT.   iliese  findings  confirm  Birnbaum's 
conjecture  (1908,  p.  492)  that  if  the  number  of  items  in  a  test  is  very 
large  tlie  inferences  that  can  be  made  about  an  examinee's  ability  will 
be  much  the  same  whether  the  Rasch  model  or  the  two-parameter  logistic 
model  is  used.   The  authors  questioned  whetlier  the  gain  obtained  with  the 
two-parameter  model  is  worth  tlie  increased  computer  cost  of  estimating 
the  item  discrimination  parameter.   Based  on  the  results  of  these  studies, 
it  is  concluded  that  the  Rascli  model  is  tlie  most  efficient  of  the  latent 
trait  models  and  hence  will  be  used  in  comparison  to  the  more  traditional 
methods  of  test  development  included  in  the  present  study. 
Comparison  of  tlie  Rasch  Model  to  Factor  Analysis 

Two  recent  studies  have  been  completed  comparing  the  Rasch  model  to 
factor  analysis.   Anderson  (1976)  posed  two  questions  concerning  the 
Rasch  model  and  factor  analysis:   (a)  what  types  of  items  would  be 
excluded  in  terms  of  difficulty  and  discrimination  using  Rasch  and 
factor  analysis  as  item  analytic  teclmiques,  and  (b)  what  effect  would 
the  two  procedures  have  on  validity?  Anderson  chose  to  use  235  middle 
school  students'  responses  to  a  15  item  Likert-type  scale  that  was 
dichotomized  for  use  witli  tlie  Rasch  model  and  the  factor  analytic 
procedures.   A  principal  component  factor  analysis  based  upon  tetrachoric 
correlation  coefficients  w  a  s  compared  to  the  Rasch  model  using  the 
CALl-'lT  computer  program  (Wright  and  Mead,  1975).   Only  items  fitting 
the  model  were  used.   His  results  indicated  that  the  Rasch  procedure 
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eliminated  the  more  difficult  items  and  the  factor  analytic  procedure 
eliminated  the  easier  items;  a  statistically  significant  difference  as 
determined  by  clii-square  test  at  £  <  .01.   For  item  discriminations  the 
Rasch  procedure  eliminated  very  low  and  very  high  item  discriminations, 
while  the  factor  analytic  procedure  tended  to  reject  only  very  low 
discriminations.   Tlie  difference  here  was  not  statistically  significant. 
The  second  question  of  test  validity  showed  very  similar  results  for 
the  two  procedures  wlien  test  score  was  correlated  with  course  grade 
point  average. 

In  a  similar  study  Mandeville  and  Smarr  (1976)  developed  a  two 
stage  design.   First  they  compared  the  Rasch  procedure  to  factor  analysis, 
then  they  combined  the  two  analytic  procedures.   The  authors  felt  the 
combined  approacli  would  be  a  more  effective  item  analytic  approach 
than  any  single  method  in  detei^mining  whicli  items  fit  the  Rasch  model. 
Two  cognitive  data  sets  (one  standardized  and  one  classroom)  and  one 
simulated  set  were  used  in  the  study.   A  rotated  principal  axis  factor 
analysis  based  upon  phi  correlation  coefficients  were  compared  to  the 
Rasch  model  using  tiie  CALFIT  program. 

The  results  indicated  that  for  the  standardized  and  simulated  data 
sets  the  double  procedure  of  factor  analyzing  the  items,  then  submitting 
only  the  items  loading  on  the  first  factor  to  the  Rasch  procedure  was 
not  really  useful.   The  Rasch  procedure  alone  was  just  as  effective  as 
the  double  procedure  in  selecting  items  that  fit  tlie  model. 

For  the  classroom  data  set  the  investigators  found  that  92  percent 
of  the  items  fit  the  Rasch  model,  but  upon  factor  analyzing  these 
items  only  seven  percent  of  the  total  test  variance  was  associated  with 
tl)e  first  factor.   Their  results  tend  to  indicate  that  factor  analysis 
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and  the  Rascl;  procedure  do  not  always  identify  the  same  unidimensional 
trait  underlying  test  performance.   However,  the  results  of  the 
Mandeville  and  Smarr  study  may  be' suspect  for  three  reasons.   First, 
the  plii  coefficient,  which  can  be  seriously  affected  when  p  and  q 
take  on  extreme  values,  was  used  as  a  basis  to  form  the  intercorrelation 
matrix  that  was  factor  analyzed.   The  greater  the  difference  in  p  and  q 
the  smaller  will  be  the  maximum  correlation,  hence  very  easy  and  very 
difficult  itej-.is  will  have  systematically  lower  coefficients  and  will 
tend  to  bias  the  results  of  the  analysis  in  favor  of  moderately  difficult 
items.   Second,  the  factor  analysis  was  based  on  a  principal  axis 
solution,  using  some  value  less  than  1.00  in  tlie  diagonal  hence  less 
variance  is  being  used  in  the  total  solution  for  comparison  with  the 
Rasch  procedure  that  is  utilizing  all  the  test  variance  available. 
Third,  the  principal  axis  solution  was  rotated  so  that  the  total 
variance  associated  with  the  first  factor  has  been  distributed  out 
among  the  other  factors  and  was  no  longer  as  strong  as  it  once  had  been. 

Suimnary 

In  the  development  of  tests  based  upon  classical  item  analysis 
two  main  statistics  are  used  in  reviewing  and  revising  test  items,  e.g., 
item  difficulty  and  item  discrimination.   The  item  discrimination  index 
provides  information  as  to  the  validity  of  the  item  in  relation  to  total 
test  score,  while  item  difficulty  indicates  ]:ow  appropriate  the  item  was 
for  the  group  tested.   A  serious  limitation  of  classical  item  analysis 
is  that  the  statistics  obtained  for  examinees  and  items  are  sample 
dependent  (Hambleton  and  Cook,  1977;  bright,  196S) . 

The  same  problem  of  sample  dependency  also  exists  for  factor 
analysis.   However,  factor  analysis  is  viewed  as  a  superior  technique 
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to  classical  item  analysis  for  two  reasons:   (a)  factor  analysis 
compares  item  intercorrelations  with  other  items  simultaneously, 
and  (b)  factor  analysis  provides  an  indication  of  how  many  factors 
or  abilities  the  test  is  measuring.   Also  in  factor  analysis,  the 
factor  loading  is  comparable  to  the  item  discrimin  ition  index  of 
classical  item  analysis,  thus  providing  a  measure  of  item  validity 
for  eacli  item  on  each  factor  in  the  test. 

Not  until  the  development  of  latent  trait  models  was  a  solution 
suggested  to  the  problem  of  sample  dependency  of  tiie  statistics  for 
items  and  examinees.   The  Rasch  model  in  particular  has  been  shown  to 
provide  item  statistics  that  are  independent  of  the  group  on  which  they 
were  obtained,  as  well  as  examinee  statistics  that  are  independent  of 
the  group  of  items  on  wliicli  they  were  tested.   Tliis  feature  of  the 
Rasch  model  provides  for  more  objective  mental  measurement. 

The  Rasch  model  lias  been  compared  to  other  latent  trait  models 
and  has  been  shown  to  be  as  efficient  in  many  cases  as  the  more  complex 
models.   Tiie  Rasch  model  has  also  been  compared  with  factor  analytic 
procedures  in  determining  test  unidimensionality ,  validity,  and  types 
of  items  retained  and  excluded  by  tlie  two  procedures.  Missing  from 
this  review  is  a  comparative  study  of  the  three  item  analytic  techniques 
using  the  same  data  base  and  a  comparison  of  the  efficiency  of  tests 
developed  from  the  three  techniques  across  ability  levels.  Also  missing 
from  the  literature  is  the  effect  of  varying  sample  size  and  number  of 
items  as  well  as  the  kinds  of  items  each  of  the  three  procedures  would 
eitlier  retain  or  exclude  in  test  development. 
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It  is  apparent  that  an  empirical  investigation  into  these  areas 
seems  warranted  to  determine  which  procedure  under  tlie  various 
conditions  would  produce  the  superior  test  in  terms  of  internal 
consistency  and  efficiency.   It  was  for  this  reason  that  the  present 
study  was  undertaken  comparing  the  three  methods  of  classical  item 
analysis,  factor  analysis,  and  the  Rasch  model  used  in  test  development, 
The  design  of  tlie  study  is  described  in  Chapter  III. 


CHAPTER  III 
METHOD 

An  empirical  study  was  designed  to  compare  the  effects  of  three 
methods  of  item  analysis  on  test  development  for  different  sample  sizes. 
The  three  methods  of  item  analysis  studied  were  classical  item  analysis, 
factor  analysis,  and  Rasch  analysis.   The  sample  sizes  used  to  compare 
the  tliree  item  analytic  methods  were  250,  500,  and  995  subjects.   The 
study  was  designed  in  three  phases:   (a)  item  selection,  (b)  a  double 
cross-validation  of  the  selected  items,  and  (c)  statistical  analyses 
of  the  selected  items.   For  each  item  analytic  procedure  two  tests 
were  developed,  a  15  item  test,  and  a  30  item  test.   Four  dependent 
variables  were  obtained  for  each  test:   (a)  an  estimate  of  internal 
consistency,  (b)  the  standard  error  of  measurement,  (c)  item  difficulty, 
and  (d)  item  discrimination.   A  description  of  the  subjects,  instrument 
used,  research  design,  and  statistical  analyses  is  presented  in  this 
chapter. 

The  Sample 

In  the  fall  of  1975,  all  high  school  seniors  in  the  State  of 
Florida  (N  =  78,751)   were  tested  as  part  of  the  State  assessment  program. 
The  population  was  from  435  high  schools  throughout  the  state.   From 
this  population  a  1  in  15  systematic  sample  of  5,250  subjects  was 
chosen  (Mendenhall , Ott,  and  Scheaffer,  1971).   A  systematic  sample  was 
selected  to  ensure  samples  from  every  high  school  in  the  state.   The 
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types  of  data  obtained  on  each  suoject  were  sex,  race,  item  responses, 
and  total  score. 

The  data  file  was  edited  to  remove  those  subjects  who  cither 
answered  all  the  items  correctly  or  incorrectly.   The  rationale  for 
this  procedure  was  that  the  Rasch  model  cannot  calibrate  items  when 
a  person  has  a  perfect  score  or  the  alternative,  when  a  person  has  no 
items  correct  [Wright,  1977).   Through  the  editing  procedure  15  subjects 
were  removed,  thus  the  available  sajnple  size  was  5,255.   Because  such 
a  small  number  of  subjects  were  removed,  it  seems  unlikely  that  the 
elimination  of  these  subjects  would  bias  the  results  in  favor  of  any 
of  the  three  item  analytic  techniques. 

The  Instrument 

Tlie  instrument  selected  for  use  in  this  study  was  the  Verbal 
Aptitude  subtest  of  the  Florida  Twelfth  Grade  Test,  developed  by  the 
Educational  Testing  Service.   It  is  a  statewide  assessment  battery 
which  has  been  administered  every  year  since  1935  (Benson,  1975).   The 
Verbal  Aptitude  subtest  is  comprised  of  50  verbal  analogies,  in  a 
multiple  choice  format,  from  which  a  single  score  based  on  the  number 
of  items  correct  is  reported.   Descriptive  information  on  the  Verbal 
Aptitude  subtest  for  the  population  tested  in  1975  is  presented  in 
Table  1. 

This  particular  instrument  was  selected  for  three  reasons.   First, 
it  is  a  cognitive  measure  of  verbal  ability  and  much  of  classical 
test  theory  has  been  build  upon  tests  in  the  cognitive  domain.   Second, 
it  is  similar  to  and  hence  representative  of  other  national  aotitude 
tests  used  for  college  admissions.   Tliird,  it  has  a  large  data  pool 
from  which  to  sample. 
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TABLE  1 

DliSCRIPTIVH  DATA  ON  THE  VERBAL  APTITUDE  SUBTEST 

OF  THE  FLORIDA  TWELFTH  GRADE  TEST 

1975  ADMINISTRATION 


Number  of  Schools  =  435  Number  of  Students  -   78,751 


Number  of  items  50 

Mean  25.95 

Standard  Deviation  8.23 

Reliability^  .88 

Standard  Error  of  Measurement  2.85 


Note:   Data  obtained  from  the  Florida  Twelfth  Grade  Testing 
Program,  Report  No.  1-75,  Fall  1975, 

Reliability  based  on  the  split-half  method,  and  corrected  by 


the  Spearman-Brown  fox'mula. 
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Classical  test  theor)'  lias  been  built  mainly  around  the  development 
of  cognitive  tests.   Therefore,  it  seemed  desirable  to  compare  the 
new  procedures  of  latent  trait  theory,  via  the  Rascli  model  to  the 
procedures  of  classical  test  tlieory,  e.g.,  factor  analysis  and  classical 
item  analysis  by  using  a  cognitive  test.   Thus,  the  results  may  be 
more  generalizable  to  the  major  type   of  tests  developed  by  practitioners 
in  the  field. 

The  Procedure 
Design 

The  sample  of  5,235  v.'as  divided  into  nine  systematic  samples  in 
the  following  manner: 

Group  =  three  independent  samples  of  250  students  each; 

Group  =  three  independent  samples  of  500  students  each; 

Group_  =  three  independent  samples  of  995  students  each. 
From  the  initial  editing  of  the  data  file,  previously  described,  15 
subjects  were  removed  from  the  total  sample  of  5,250.   Therefore, 
it  was  decided  that  this  loss  of  subjects  would  only  affect  Group 
since  it  was  the  largest.   Thus,  the  number  of  subjects  in  each  of 
the  three  independent  samples  was  reduced  by  five,  resulting  in  three 
independent  samples  of  995  subjects  each. 

The  purpose  of  obtaining  the  three  separate  samples  for  t]>e  three 
groups  was  to  insure  that  each  item  analytic  and  double  cross-validation 
procedure  used  an  independent  sample,  so  that  tests  of  statistical 
significance  could  be  perform.ed.   The  scheme  shown  in  Table  2  was  used 
to  obtain  tlie  nine  samples.   In  the  present  study  the  independent 
variables  were  sample  size  and  item  analytic  procedure. 
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TABLE  2 

SYSTEMATIC  SAMPLING  DESIGN  OF  THE  STUDY" 
N  =  5,235 


Group 

Sampling 
Procedure 

Sample 

Number 

Number 

Selected 

Item  Analytic 
Procedure 

Total  Sample 
Remaining 

Group 

in 
in 
in 

20 
19 

18 

1 
2 

3 

25U 
250 
250 

Classical 
Factor  Analysis 
Rasch 

4,985 
4,735 

4,485 

Group-, 

in 
in 
in 

S 
7 
6 

4 
5 
6 

500 
500 
500 

Classical 
Factor  Analysis 
Rasch 

3,985 
3,485 
2,985 

Group „ 

1  in  3 
1  in  2 
remaining 

7 
8 
9 

995 
995 
995 

Classical 
Factor  Analysis 
Rasch 

1,995 

995 

0 

The  sampling  procedure  was  randomly  assigned  to  item  analytic 
technique  in  group  and  tlie  same  pattern  carried  out  for  group^ 
and  group ^. 


Those  subjects  edited  from  the  data  file  were  removed  equally  from 
group  ,  hence  the  reduced  sample  size. 
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The  item  data  were  analyzed  in  thi-ee  pliases :   (a)  selection  of  the 
items,  (b)  computation  of  item  and  test  statistics  for  selected  items 
on  double  cross-validation  samples,  and  (c)  statistical  analyses  of 
item  characteristics  to  test  tlie  hypotheses. 
Item  Selection 

The  three  independent  samples,  within  each  of  the  groups  of  subjects 
(N  =  250,  N  =  500,  N  =  995),  were  submitted  to  one  of  the  three  item 
analytic  procedures  (in  accordance  with  Table  2)  in  order  to  select  a 
specified  number  of  items,  e.g.,  the  "best"  15  and  30  items.   Each  of 
these  two  sets  of  items  comprised  two  separate  tests;  however,  all  of 
the  items  on  the  15  item  tests  were  always  included  on  each  of  tlie  30  item 
tests.   A  different  process  for  selecting  the  items  was  used  with  each 
item  analytic  technique,  and  has  been  described  in  the  following  three 
sections. 

Classical  item  analysis.   The  definition  of  the  "best"  items  was 
based  on  the  numerical  magnitude  of  the  items'  biserial  correlations. 
The  biserial  correlation  was  defined  as  the  correlation  between  the 
artifically  dichotomized  item  score  (1  or  0)  and  total  test  score. 
In  using  the  biserial  correlation  the  assumption  was  made  that  the 
artifically  dichotomized  variable  (the  item)  had  a  continuous  and  normal 
distribution  (Magnusson,  1966) . 

In  order  to  obtain  biserial  correlations  for  the  items  under  the 

classical  item  analysis  procedure,  the  50  vei'bal  items  were  submitted 

5 
to  the  item  analysis  program,  GITAP  for  each  of  the  three  sample  sizes. 


The  Generalized  Item  Analysis  Program  (GITAP)  is  a  part  of  the 
test  analysis  package  developed  by  F.  B.  Baker  and  T.  J.  Martin, 
Occasional  Paper  No.  10,  Michigan  State  University,  1970. 
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The  15  and  30  items  with  the  highest  biserial  correlations  were  selected 
as  the  best  items  from  the  total  subtest.   Item  difficulties  were  also 
obtained  for  the  "best"  15  and  30  items  selected.   Item  difficulty  has 
been  defined  as  the  proportion  of  persons  getting  a  particular  item 
correct  out  of  the  total  number  of  persons  attempting  that  item 
(Mehrens  and  Lehman,  1973). 

Factor  analysis.   Item  selection  based  on  factor  analysis  was 
accomplished  using  the  computer  programs  developed  for  the  Education 
Evaluation  Laboratory  at  the  University  of  Florida.   Tliese  programs 
have  been  described  by  Guertin  and  Bailey  (1970).   The  present  study 
was  concerned  only  with  the  items  that  load  on  the  first  principal 
component,  in  order  to  adhere  to  the  unidimensionality  assumption  of 
the  test.   The  principal  components  analysis  was  based  on  a  matrix  of 
tetrachoric  item  intercorrelations  with  unities  in  the  diagonal. 

Tlie  tetrachoric  correlation  was  chosen  to  produce  the  intercorrela- 
tion  matrix  for  the  same  reason  tlie  biserial  correlation  was  chosen: 
Knowledge  of  an  item  was  assiomed  to  be  normal  and  continuously  dis- 
tributed.  In  the  case  of  the  tetrachoric  correlation  each  item  (scored 
1  or  0)  was  correlated  with  every  other  item. 

The  15  and  30  items  with  the  highest  loadings  on  the  first  unrotated 
principal  component  were  selected  from  the  total  subtest.   These  com- 
ponent loadings  are  analogous  to  biserial  correlations  previously 
described,  where  the  loading  refers  to  the  relationship  of  the  item  to 
the  principal  component  or  factor  (Guertin  and  Bailey,  1970,  Henrysson, 
1962). 

Rasch  analysis,   llie  selection  of  items  based  on  the  Rasch  model 


was  accomplished  in  two  stages.   First,  in  order  to  check  the  assumption 
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)f  a  unidimensional  test,  a  factor  analysis  using  a  principal  components 


solution  was  used.   Items  were  selected  with  loadings  between  .39  and 


79  on  the  first  unrotated  factor-,  to  hold  the  discrimination  index 


of  the  items  constant.  Hambleton  and  Traub  (1971)  have  shown  that  the 
efficiency  of  a  test  developed  using  the  Rasch  model  will  remain  very 
high  (over  95  percent)  when  the  range  on  the  discrimination  index  was 
held  between  .59  and  .79.   Second,  the  items  selected  from  the  principal 


components  solution  using  the  above  criteria  were  submitted  to  a 


Rasch  analysis  using  the  IbICAL  i/rogram  (Wright  and  Mead,  1976)  .   Items 
were  selected  based  upon  the  mean  square  fit  of  the  items  to  the 
Rasch  model.   The  best  15  and  30  items  fitting  the  model  were  chosen 
from  the  total  subtest,  and  their  corresponding  item  difficulties 


reported. 

Double  Cross-Validation 

A  double  cross-validation  design  (Mosier,  1951)  was  used  to  obtain 
item  parameter  estimates  for  the  best  15  and  50  items  selected  by  the 
three  item  analytic  techniques  for  the  three  sample  sizes.   In  this 
study  a  5  X  3  latin  square  was  used  to  reassign  samples.   This  procedure 
ensured  that  the  estimates  of  the  item  parameters  would  be  based 
upon  a  different  sample  of  subjects  than  the  original  sample  used  to 
identify  the  best  items.   Each  item  analytic  technique  was  randomly 
reassigned,  using  a  latin  square  procedure  (Cochran  and  Cox,  1957, 
p.  121),  to  a  different  sample  within  each  of  the  three  groups  (N  -   250, 
N  =  500,  N  =  995) .   The  double  cross-validation  design  is  shown  in 
Table  5. 

The  best  15  and  30  items  selected  by  each  item  analytic  procedure 
in  the  first  phase  of  the  study,  were  submitted  to  a  standard  item 
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TABLE  3 
DOUBLE  CROSS-VALIDATIOM  DESIGN  OF  TliE  STUDY 


Sample 
Group   Number^ 


Number 
Selected 


Item  Analytic 
Procedure 


Double  Cross- 
Validation  Procedure 


GrouD 


1 


Group 


Group. 


1 
2 
3 

4 
5 
6 

7 
8 
9 


250  Classical 

250  Factor  Analysis 

250  Rasch 

500  Classical 

500  Factor  Analysis 

500  Rasch 

995  Classical 

995  Factor  Analysis 

995  Rasch 


Factor  Analysis 

Rasch 

Classical 

Rasch 
Classical 
Factor  Analysis 

Rasch 
Classical 
Factor  Analysis 


The  sample  number  is  the  same  as  referred  to  in  Table  2. 


Assignment  to  sample  was  based  on  a  randomized  3X3  latin 
square  procedure. 
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analysis  program  (GITAP)  from  wliich  were  obtained  the  dependent 
variables  in  the  study: 

•indices  of  internal  consistency  as  measured  by  the  analysis 
of  variance  procedure  (Hoyt,  1941) 

•the  standard  error  of  measurement 

•item  difficulty 

•biserial  correlations 
By  submitting  the  best  15  and  50  items  selected  by  each  item  analytic 

procedure  in  the  study  to  a  common  item  analysis  program  comparable 

measures  of  the  dependent  variables  were  obtained. 
Statistical  Analyses 

The  third  phase  of  the  study  focused  on  obtaining  measures  of 
statistical  significance  for  three  of  the  dependent  variables:   internal 
consistency,  item  difficulties,  and  biserial  correlations.   Only  visual 
comparisons  were  made  for  the  remaining  dependent  variable,  the  standard 

error  of  measurement.   The  internal  consistency  estimates  from  each  test 
were  compared  to  the  projected  population  value  for  tests  of  similar 

length  via  confidence  intervals  as  suggested  by  Feldt  (1965).   (Projected 

population  values  were  obtained  using  the  Spearman-Brown  Prophecy  Formula.) 

Item  difficulties  for  the  15  and  50  best  items  were  submitted  to  a 

two-way  analysis  of  variance,   the  two  factors  being  sample  size  and  item 

analytic  technique.   This  procedure  was  used  to  test  for  differences  in 

the  types  of  items  selected,  in  terms  of  item  difficulty,  by  each  technique. 

If  statistical  significance  was  observed,  with  a  =  .05,  Tukey's  HSD 

(honestly  significant  difference)  post  hoc  procedure  (Kirk,  1968)  was 


The  analysis  of  variance  procedure  is  appropriate  only  if  the 
distribution  of  the  item  difficulties  and  (transformed  biserial  correlations 
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employed  to  determine  which  item  analytic  teclinique(s)  resulted  in  a 
test  with  the  highest  item  difficulties. 

The  biserial  correlations  were  transformed  to  an  interval  scale 
of  measurement  using  a  linear  function  of  £  suggested  by  Davis  (1946). 
The  linear  transformation  was  based  upon  converting  the  biserial 
correlation  to  z  values,  and  then  eliminating  tlie  decimals  and  negative 
values  of  z  by  multiplying  the  constant  60.241  to  each  z_  value  (Davis, 
1946,  pp.  12-15).   Thus,  the  range  of  the  transformed  biserials  ranged 
between  0  and  100.   A  two-way  analysis  of  variance   (sample  size  by  item 
analytic  technique)  was  performed  on  the  transformed  biserial  correlations 
for  the  best  15  items.   Tliis  type  of  analysis  was  used  to  test  for 
differences  in  the  types  of  items  selected,  in  terms  of  biserial  correla- 
tions by  each  technique.   If  statistical  significance  was  observed, 
a  =  .05,  Tukey's  HSD  post  hoc  procedure  was  employed  to  determine  which 
item  analytic  technique(s)  resulted  in  higher  transformed  biserial 

correlations. 

The  two-way  analysis  of  variance  and  post  hoc  analysis,  where 
indicated,  for  the  transformed  biserial  correlations  was  performed  on 

the  30  best  items. 

In  addition  to  tests  of  statistical  significance,  a  measure  of 

the  efficiency  of  the  30  best  items  selected  by  each  procedure  was 

compared  for  the  sample  of  995  subjects.   Birnbaum  (1968)  defined  the 

relative  efficiency  of  two  testing  procedures  as  the  ratio  of  their 


approximates  normality  and  the  variances  are  homogeneous  (Ware  and  Benson, 
1975)  . 

The   analysis  of  variance  procedure  is  appropriate  only  if  the 
distribution  of  the  item  difficulties  and  (transformed)  biserial  correla- 
tions approximates  normality  and  the  variances  are  homogeneous  (Ware  and 
Benson,  1975). 
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information  curves.   Lord  (1974aJ  lias  described  a  procedure  to  compare 
the  relative  efficiency  of  one  test  with  another  at  different  ability 
levels.   If  two  tests  to  be  compared  vary  in  difficulty,  then  the 
relative  efficiency  of  each  will  usually  be  different  at  different 
ability  levels  (Lord,  lC74b;  1977).   In  classical  test  theoiy  it  is 
common  to  compare  two  tests  that  measure  the  same  ability  in  terms  of 
their  reliability  coefficients,   but  this  only  gives  a  single  overall 
comparison,   llie  foiT.iula  developed  by  Lord  for  relative  efficiency 
provides  a  more  precise  way  of  comparing  two  tests  that  measure  the 
same  ability.  The   formula  for  approximating  relative  efficiency  is 
(Lord,  1974b,  p.  248): 

2 

n  c   r    ^      "  X  (n  -x)f  .,  „. 

R.E.  (y,x3  =    y   •         X    -x  ,       (loj 

n        y  (n  -y)^2 
X        '  '  y  'fy 

where  R.E.  denotes  the  relative  efficiency  of  y  compared  to  x,  n  and  n 

denote  the  number  of  items  in  the  two  tests,  x  and  y  are  tlie  number- 

2      2 
right  scores  having  the  same  percentile  rank,  and  f _  and  f  are  the 

squared  observed  frequencies  of  x   and  y.   Lord  lias  suggested  that 

formula  13  only  be  used  with  a  large  sample  of  examinees  and  tests  that 

are  not  extremely  short,  hence  this  comparison  was  restricted  to  the 

case  where  N  =  995  and  the  30  item  test. 

Tliree  relative  efficiency  comparisons  using  the  30  item  tests  were 

made:   (a)  the  test  based  on  factor  analysis  was  compared  to  the  test 

based  on  classical  item  analysis,  (b)   the  test  based  on  the  Rasch 

analysis  was  compared  to  the  test  based  on  classical  item  analysis,  and 

(c)  the  test  based  on  the  Rasch  analysis  was  compared  to  the  test  based 

on  factor  analysis. 
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Summary 

An  empirical  study  was  designed  to  compare  the  effects  of  classical 
item  analysis,  factor  analysis,  and  the  Rascli  model  on  test  development. 
Item  response  data  were  obtained  from  a  sample  of  5,255  high  school 
seniors  on  a  cognitive  test  of  verbal  aptitude. 

The  subjects  were  divided  into  9  samples:   three  independent 
groups  of  250  subjects  eacli,  three  independent  groups  of  500  subjects 
each,  and  tlu-ee  independent  groups  of  995  subjects  each.   The  independent 
groups  were  obtained  so  that  tests  of  statistical  significance  could 
be  perfoi'ined. 

The  item  response  data  were  then  analyzed  in  three  phases.   First, 
the  "best"  15  and  30  items  were  selected  using  each  item  analytic 
technique.   Under  classical  item  analysis,  the  best  15  and  30  items 
were  selected  based  on  the  liigliest  biserial  correlations.   For  factor 
analysis,  the  best  15  and  30  items  were  selected  based  on  the  highest 
item  loadings  on  the  first  (unrotated)  principal  component.   The 
selections  of  the  best  15  and  30  items  using  the  Rascli  model  were 
based  upon  the  mean  square  fit  of  the  items  to  tlie  model.   These 
procedures  were  used  for  each  group  of  subjects.   Second,  a  double 
cross-validation  design  was  employed  to  obtain  estimates  on  the  item 
parameters  for  tlie  best  15  and  30  items.   The  three  item  analytic 
techniques  were  reassigned  randomly  to  different  samples  of  subjects 
within  each  level  of  sample  size.   Then,  the  best  15  and  30  items 
chosen  by  each  method  were  submitted  to  a  conmion  item  analytic  procedure 
in  order  to  obtain  estimates  for  comparing  the  three  item  analytic 
methods.   Third,  a  two-way  analysis  of  variance  and  a  Tukey  post  hoc 
comparison  test,  when  indicated,  were  used  to  test  for  differences  in 
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the  properties  of  items  selected  by  each  item  analytic  procedure.   Also 
confidence  intervals  were  calculated  to  compare  the  internal  consistency 
estimates  to  a  population  value.   In  addition,  tlie  relative  efficiencies 
of  the  30  item  tests  developed  by  eacl;  item  anal)'tic  technique  were 
compared  for  the  sample  of  995  subjects. 


•• 


CHAPTER  IV 
RESULTS 

The  study  was  designed  to  compare  empirically  the  precision  and 
efficiency  of  tests  developed  using  three  item  analytic  techniques: 
classical  item  analysis,  factor  analysis,  and  the  Rasch  model.   The 
following  five  hypotheses  were  generated  to  compare  the  three  techniques: 

1.  There  are  no  significant  differences  in  the  internal  consistency 
estimates  of  the  tests  produced  by  the  three  methods  as  the  number  of 
items  decreases  when  compared  to  the  projected  internal  consistency 
estimates  for  the  population  for  tests  of  similar  length. 

2.  There  are  no  differences  in  the  internal  consistency  estimates 
of  the  tests  produced  by  the  three  methods  when  the  nunber  of  examinees 

is  decreased. 

g 

3.  There  are  no  meaningful  differences  in  the  magnitude  of 

the  standard  error  of  measurement  of  the  tests  produced  by  the  three 
methods . 

4.  There  are  no  significant  differences  in  the  difficulties  or 
discriminations  of  the  items  selected  by  the  three  methods. 


A  meaningful  difference  was  previously  defined  to  be  >_  1.00. 
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5.   Tliere  are  no  differences  across  ability  levels  in  the  efficiency 
of  the  tests  produced  by  the  tliree  metliods. 

The  Verbal  Ajjtitude  subtest  of  the  1-lorida  iwelfth  Grade  Test,  was 
used  to  test  the  hypotlieses.   A  sample  of  5,235  examinees  was 
systematically  selected  from  a  population  of  78,751.   A  demographic 
breakdown  of  the  sample  by  etl;nic  origin  and  sex  is  presented  in  Table  4. 

The  data  were  analyzed  and  reported  in  the  following  manner:   item 
selection,  double  cross-validation,  comparison  of  the  15  item  tests  on 
precision  and  comparison  of  the  30  item  tests  on  precision  and  efficiency. 
These  results  were  then  summarized  witli  respect  to  tlie  five  h)-potheses. 

Item  Selection 

The  50  items  on  the  Verbal  Aptitude  subtest  were  submitted  to  each 
of  the  tliree  item  analytic  techniques.   The  means,  medians,  and  standard 
deviations  of  t'ne  bi  serial  correlations  and  item  difficulties,  based 
on  classical  item  analysis,  are  presented  in  Table  5.   These  descriptive 
statistics  appear  ecjuivalent  across  the  varying  sample  sizes. 

From  tlie  factor  analysis,  the  percentage  of  total  test  variance 
accounted  for  by  the  50  verbal  items  on  the  first  unrotated  principal 
component  has  been  reported  in  Table  6.   The  percentage  of  variance 
accounted  for  by  the  first  principal  component  was  obtained  by  summing 
the  squared  item  loadings  and  dividing  by  the  total  number  of  items. 
The  percentages  of  variance  accounted  for  by  the  first  principal  component 
in  eacli  sample  were  very  similar.   A  check  on  the  unidimensionality 
of  the  test  was  made  by  rotating  the  principal  components  solution  for 
the  sample  of  995  subjects.   Upon  rotation,  the  results  indicated 
one  dominate  factor  remained. 
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Three  statistics  ai-e  reported  in  Table  7  for  the  Rasch  item 
analysis  procedure.   For  each  sample  size,  the  percentage  of  total 
variance  accounted  for  by  the  first  unrotated  principal  conponent 
and  the  means  and  standard  deviations  of  the  mean  square  fit  statistic 
and  Rasch  difficulties  are  presented  for  the  items  selected. 

In  order  to  select  the  best  15  and  30  items  from  the  Rasch  analysis, 
all  50  items  were  submitted  to  a  principal  components  solution.   Tliis 
procedure  was  used  to  ensure  that  the  items  selected  measured  one 
trait,  as  required  by  the  assimiption  of  test  unidimensionality.   As 
noted  in  Table  1 ,    the  percentage  of  total  test  variance  accounted  for 
by  the  first  principal  component,  based  on  50  items,  was  nearly  equal 
for  each  sample  size. 

From  the  principal  components  solution  only  items  with  loadings 
between  .39  and  .79  were  selected  for  the  Rasch  analysis  as  suggested 
by  Hambleton  and  i'raub  (1971),  to  adhere  to  the  assumption  of  equal 
item  discriminations.   Using  this  procedure  the  number  of  items  (out 
of  50)  retained  for  the  Rascli  analysis  varied  slightly  with  sample 
size;  when  N  =  250,  53  items  were  retained,  when  N  -   500,  35  items 
were  retained,  and  when  N  =  995,  33  items  were  retained.   Tliese  items, 
loading  between  .59  and  .79,  were  then  submitted  to  the  Rasch  analysis 
to  obtain  mean  square  fit  statistics  and  Rasch  item  difficulties. 
These  st;;tistics  have  been  reported  in  Table  7. 

Wriglit  and  Panchapakesan  (1969)  developed  a  measure  to  assess  the 
fit  of  the  item  to  the  Rasch  model.   Tlie  measure,  defined  as  the  mean 
square  fit  statistic,  is: 
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2  _  k- 1   n      2 

^  ^  i=i   j=l   ^ij    •  (14) 

2 
The  quanity,  x  >  defined  above  has  approximately  the  chi  square 

distribution  with  degrees  of  freedom  equal  to  (k-1)  (n-1).   Tlie 

value,  y. .  is  the  deviation  of  the  item  from  the  model,  or  item  misfit, 
ij 

and  is  determined  by  taking  the  difference  between  the  observed  and 
expected  frequency  of  the  examinees  at  a  given  ability  level  who 
answered  a  given  item  correctly.   Tliis  difference  was  then  divided 
by  the  standard  deviation  of  the  observed  frequency,  squared  and 
summed  over  items  and  score  groups.   The  BICAL  program  standardizes 
these  deviations  (y- .)  in  computing  the  mean  square  fit  statistic; 
therefore,  y. .  has  a  normal  distribution  with  a  mean  of  zero  and 
standard  deviation  of  one  (Hambleton  et  al.,  1977).   Items  with  large 
mean  square  fit  values  are  items  which  do  not  fit  the  model.   As  shovm 
in  Table  7,  the  mean  and  standard  deviation  of  the  mean  square  fit 
statistic  increased  with  sample  size. 

The  item  difficulty  estimates  based  on  the  Rasch  model  also  have 
an  expected  mean  of  zero  and  standard  deviation  of  one  (Wright  and 
Mead,  1975).   These  estimates  remained  very  similar  across  sample 
size  and  exceptionally  close  to  the  expected  values  (Table  7). 

The  Rasch  model  does  not  provide  a  parameter  for  item  discriminating 
power  as  all  item  discriminations  are  considered  equal  and  centered 
at  one  (Wright  and  Mean,  1975).   Tlie  BICAL  program  provided,  as  part 
of  the  normal  output,  estimates  of  the  item's  discriminating  power 
to  check  the  fit  of  the  data  to  the  model.   The  item  discriminations 
were  obtained  by  regressing  the  difficulty  of  the  item  for  each  ability 
group  on  the  ability  estimate  of  the  group  (Wright  and  Mead,  1975,  p.  11). 
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Tlie  means  and  standard  deviations  for  the  item  discrimination  estimates 
were  shown  in  Table  8  for  each  sample  size. 

TABLE  8 

DESCRIPTIVE  DATA  ON  ITEM  DISCRIMINATION  ESTIMATES 
BASED  ON  THE  RASCH  MODEL  ACCORDING 
TO  SAMPLE  SIZE 


N  =  250^ 
K  =  53b 


N  =  500 
K  =  35 


Mean 

Standard 
Deviation 


1.03 

.28 


1.02 
.19 


N  =  sample  size 

K  =  number  ot  items 


N  =  995 
K  =  33 


1.03 
.22 


From  the  data  in  Table  8,  the  mean  item  discrimination  estimates 
appear  nearly  equal  for  each  sample  size,  and  quite  close  to  the  mean 
expected  value  of  one. 

The  best  15  and  30  items  were  then  selected  by  each  item  analytic 
procedure  based  on  the  information  in  Tables  5-7,  and  have  been  listed 
in  Tables  9  and  10  respectively. 

The  items  selected  under  classical  item  analysis  were  determined  by 
the  magnitude  of  the  biserial  correlation,  e.g.,  the  15  and  30  items 
having  the  higliest  biserial  correlations  with  total  test  score  were 
selected.   Indices  of  item  difficulty  have  been  reported  for  inspection, 
but  in  no  way  influenced  the  selection  of  items  for  classical  item 
analysis. 
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The  selection  of  items  under  factor  ai^alysis  was  determined  by 
the  item  loadings  on  the  first  unrotated  principal  component.   The 
15  and  30  items  having  the  highest  item-component  biserial 
correlation  were  selected. 

The  selection  of  the  15  and  30  items  from  the  Rasch  analysis  was 
determined  by  the  mean  square  fit  of  the  item  to  tlie  Rasch  model.   The 
closer  the  mean  square  fit  was  to  zero  the  better  the  item  fit  the 
model,  thus  items  with  tl\e  lowest  mean  square  fit  statistic  were 
selected. 

Double  Cross-Validation 
After  the  tests  of  the  best  15  and  30  items  were  developed  by  each 
procedure,  they  were  scored  on  independent  samples,  in  a  double  cross- 
validation  procedure  as  noted  in  Table  3,  Chapter  III.   Item  and 
test  statistics,  needed  to  test  the  five  hypotheses  were  obtained  for 
the  15  and  30  item  tests  based  on  the  cross-validation  samples  using  the 
GITAP  program  (Baker  and  Martin,  1970). 

The  GITAP  program  provided  the  following  output: 
each  subject's  total  test  score 
test  mean  and  standard  deviation 
.   internal  consistency  estimates  as  measured  by  Hoyt's 

analysis  of  variance  procedure 
.   estimates  of  the  standard  ei'ror  of  measurement 
.   indices  of  item  difficulty  and  biserial  correlations 

Comparison  of  the  15  Item  Tests  on  Precision 
The  descriptive  statistics  based  on  the  double  cross-validation 
samples  for  the  15  item  tests  have  been  presented  in  Table  11. 
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The  values  of  the  iiiternal  c^^isistency  estimates  for  the  tests 
developed  using  the  Rasch  model  were  consistently  lower  than  the 
internal  consistency  estimates  of  the  tests  developed  by  classical 
item  analysis  and  factor  analysis  across  all  sample  sizes. 

Tlie  observed  internal  consistency  estimates  were  tested  for 
significance  using  confidence  intervals  described  by  Feldt  (1965),  to 
see  if  they  were  statistically  different  from  the  internal  consistency 
estimate  for  the  projected  population  using  the  Spearman-Brown  Prophecy 
Formula. 

The  internal  consistency  estimate  for  the  population  based  on  the 
original  50  item  subtest  was  .88  (Table  1).   By  applying  the 
Spearman-Brown  Prophecy  Formula  (Mehrens  and  Lehman,  1975)  the  projected 
population  internal  consistency  estimate  for  a  15  item  test  was  found 
to  be  .687.   The  value  .687  was  the  expected  internal  consistency  if  55 
of  the  50  items  were  randomly  deleted.   Thus,  confidence  intervals 
were  generated  around  the  observed  internal  consistency  estimates, 
presented  in  Table  11,  for  each  procedure  across  all  sample  sizes  to 
see  if  any  of  the  three  item  analytic  teclmiques  would  produce  a  more 
reliable  test  than  would  be  expected  from  mere  random  item  deletion. 
The  confidence  intervals  for  the  observed  consistency  estimates 
for  each  procedure  have  been  reported  in  Table  12. 

When  tlie  sample  sizes  were  250  and  995  eacli  item,  analytic  technique 
produced  an  internal  consistency  estimate  that  was  significantly  different 
from  the  projected  population  estimate  (.687)  at  a  confidence  level  of 
95  percent.   Each  of  the  tl;ree  techniques  systematically  retained 
the  15  most  homogeneous  items.   These  tests  were  more  precise 
in  terms  of  internal  consistencv  than  would  have  been  found  if  the 
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items  were  randoml)-  deleted  as  noted  by  comparisons  to  tlie  projected 
population  reliability  coefficient. 

Table  12 

CONFIDENCE  INTERVALS^  FOR  THE  OBSERVED  INTERNAL 

CONSISTENCY  ESTIM/\TES  BASED  ON  THE  15 

ITEM  TESTS  ACCORD  I  F^G  TO  SAMPLE  SIZE 


95 V  Confidence  Interval 
Procedure   N  =  250  N  =  500  N  =  995 


Classical  .748  -  .828*       .792  -  .838*      .810  -  .841 

Factor  Analysis      .760  -  .837*       .786  -  .854*      .797  -  .831 
Kasch  .704  -  .799*       .688  -  .757       .711  -  .759 


* 


* 


* 


The  £  values  used  in  calculating  the  confidence  intervals  were  obtained 
from  f'larisculo  (1971). 

+ 
Statistical  significance  is  indicated  when  tlie  population  internal 
consistency  estimate  is  not  concluded  in  tlie  confidence  interval 
generated  for  each  observed  internal  consistency  estimate.   The 
projected  population  value  was  .687. 


Only  two  procedures  produced  tests  with  internal  consistency 
estimates  significantly  different  from  the  projected  population  estimate 
when  tlie  sample  size  was  500,  classical  item  analysis  and  factor 
analysis . 

As  sample  size  decreased,  in  most  cases,  the  internal  consistency 
for  each  method  tended  to  decrease  (Table  11).   An  exception  was  noted 
for  the  Rasch  tests,  when  the  sample  size  decreased  from  500  to  250, 
internal  consistency  improved  slightly. 

The  data  reported  in  Table  11  indicated  that  the  standard  error 
of  measurement  for  the  15  item  tests  based  on  the  Rasch  model  were 
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consistently  larger  than  the  stanu.ird  error  of  nicasurcincnt  of  the 

tests  developed  from  classical  item  analysis  and  factor  analysis  for  each 

sample  size.   However,  these  differences  were  not  meaningful  in  that 

the  difference  did  not  equal  or  exceed  1.00  for  any  of  the  three 

procedures. 

The  differences  in  mean  item  difficulties  and  discriminations 
were  tested  for  statistical  significance  to  determine  whether  there  were 
differences  in  the  types  of  items  retained  by  eacli  item  anal}-tic  method. 
In  tliis  study  item  discriminations  were  measured  by  biserial  correlations. 
A  two-way  analysis  of  variance  (fixed  effects  model)  was  performed 
separately  for  the  two  dependent  variables  of  item  difficulty  and  item 
discrimination.   A  check  was  made  on  the  assumptions  for  the  analysis 
of  variance  to  ensure  that  they  were  met.   In  tliese  analyses,  item 
analytic  technique  and  sample  size  were  the  two  independent  factors,  each 
with  three  levels. 

For  item  difficulty,  no  significant  differences  were  found  for 
item  analytic  technique,  sample  size,  or  their  interaction,  F  (2,1261  = 
2.57,  p  >  .05;  F  (2,126)  =  .45,  £  >  .05;  £  (4,126)  =  .33,  £  >  .05 
respectively. 

The  means,  standard  deviations,  and  ranges  of  the  item  difficulties 
based  upon  the  15  item  tests  have  been  reported  in  Table  13. 

For  the  analysis  of  variance  performed  on  the  transformed 
biserial  correlations  a  significant  F.  ratio  was  observed  for  the  factor 
of  item  analytic  technique,  F  (2,126)  =  14.862,  p  <  .05.   No  significant 
differences  were  observed  for  sample  size  or  the  interaction  of  sample 
size  and  item  analytic  technique  for  the  transformed  biserial 
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correlations   F  (2,126)  -  .30,  p  >  .05;  F  C4,126)  =  1.16,  £  >  .05 
respectively.   The  means,  standard  deviations,  and  ranges  of  the 
transformed  biserial  correlations,  based  upon  the  15  item  tests  have 
been  presented  in  Table  14. 

TABLE  13 

15  ITEM  TESTS: 

DESCRIPTIVE  STATISTICS  FOR  ITEM  DIFFICULTY 

BY  PROCEDURE  AND  SAMPLE  SIZE 


Pro 

:edure 
Factor 

Sample  Size 

- 

Classical 

Analysis 

Rasch 

250 

500 

995 

Mean 

.65 

.67 

.60 

.66 

.63 

.63 

Standard 

Deviation 

.15 

.14 

.16 

.15 

.15 

.16 

Range 

.31-. 91 

.35-. 92 

.27-. 90 

.31-.C 

)2  .52-. 92  . 

27-. 91 

Post  hoc  comparisons  were  made  to  determine  which  of  the  three 
item  analytic  procedures  based  upon  their  means  contributed  to  the 
significant  F^  ratio  for  the  transformed  biserial  correlations.   Tukey's 
HSD  (honestly  significant  difference)  test  for  multiple  comparisons 
was  employed  (Kirk,  196S,  p.  88).   The  HSD  value  (a  =  .01),  was  6.88. 
Therefore,  a  difference  between  means  had  to  exceed  this  value  to  be 
significantly  different.   The  results  of  the  post  hoc  comparisons 


9 
When  the  actual  biserial  correlations  were  tested  in  tlie  two-way 

analysis  of  variance  design  similar  F  ratios  were  observed. 
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between  tlic  mean  item  discriminations  liave  been  reported   in  Table   15, 


TABLE   14 

15  ITEM  lESTS: 
DESCRIPTIVE  STATISTICS  FOR  ITEM  DISCRIMINATIONS' 
BY  PROCEDURE  AND  SAMPLE  SIZE 


Mean 


Procedure 


Classical 


Factor 
Analysis 


55.60 


54.49 


Rasch 


43.51 


Standard 

Deviation 

12.18 

11.98 

8.21 

Range 

29-82 

34-91 

28-64 

Sample  Size 


250 


500 


995 


51.20   49.47   50.98 

14.17   11.20   10.36 
34-91   28-78   30-73 


^Based  on  transformed  biserial  correlations.   The  transformation  was  a 
linear  transformation  of  the  Fisher  z_  statistic  and  multiplication 
of  the  constant  60.241  providing  a  range  of  0-100  for  the  biserial 
correlation  (Davis,  1946). 


From  Table  15,  it  is  apparent  that  the  mean  transformed  biserial 
correlation  from  the  Rasch  developed  test  was  significantly  lower 
than  the  mean  biserial  correlations  from  the  tests  developed  by 
classical  item  analysis  and  factor  analysis. 

Comparison  of  the  30  Item  Tests  on  Precision 

Tlie  descriptive  statistics  based  on  the  double  cross-validation 
of  the  30  items  selected  by  each  procedure,  according  to  sample  size, 
have  been  presented  in  Table  16. 
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TABLE  15 

POST  HOC  COMPARISONS  OF  THE  DIFFERENCES 

BETWEEN  THE  MEAN  ITEM  DISCRIMINATIONS^ 

FOR  11  IE  15  ITEM  TESTS 


Means 
54.49  53760  43.51 


Factor  Analysis  .889  10.98** 

(54.49) 

Classical  Item 

Analysis  10.09** 

(53.60) 

Rasch  Analysis  

(43.51) 

^Based  on  transformed  biserial  correlations.   Tlie  transformation  was  a 
linear  transformation  of  the  Fisher  z_   statistic  and  multiplication 
of  the  constant  60.241  providing  a  range  of  0-100  for  the  biserial 
correlation  (Davis,  184b). 

**  £  <  .01,  HSD  =6.88. 

By  increasing  the  test  length  to  30  items,  the  internal  consistency 
estimate  was  increased  across  each  method  and  sample  size,  but  a  pattern 
similar  to  that  for  the  15  item  test  emerged.   The  internal  consistency 
estimates  from  the  test  based  on  the  Rasch  model  were  slightly  lower 
than  the  internal  consistency  estimates  for  the  tests  based  on 
classical  item  analysis  and  factor  analysis.   The  observed  internal 
consistency  estimates  were  tested  for  significance,  using  the  confidence 
intervals  described  in  the  previous  section,  to  see  if  they  were 
statistically  different  from  the  internal  consistency  estimate  for  the 
population. 

The  projected  population  internal  consistency  estimate  for  a  30 
item  test  was  found  to  be  .814  (via  the  Spearman-Brown  Prophecy  Formula). 
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The  value  of  .814  indicated  tlio  expected  internal  consistency 
if  20  of  the  50  items  were  randomly  deleted. 

Based  on  tlie  observed  internal  consistency  estimates  reported 
in  Table  16,  confidence  intervals  were  generated  for  each  item  analytic 
procedure  and  have  been  presented  in  Table  17. 

TABLE  17 

CONFIDENCE  INTERVALS^  FOR  THE  OBSERVED  INTERNAL 
CONSISTENCY  ESTIMATES  BASED  ON  THE  30 
ITEM  TESTS  ACCORDING  TO  S.AMPLE  SIZE 

95"b  Confidence  Interval 
N  =  250         N  =500         N  =  955 

Classical  .800-. 865       .845-. 880*      .850-. 874* 

Factor  Analysis      .825-. 881*      .834-. 871*      .848-. 873* 
Rasch  .794-. 860       .817-. 858*      .838-. 863* 


Tlie  ¥_  values  used  in  calculating  the  confidence  intervals  were  obtained 
from~Marisculo  (1971). 

* 
Statistical  significance  is  observed  when  the  population  internal 

consistency  estimate  is  not  included  in  the  confidence  interval 

generated  for  each  observed  internal  consistency  estimate.   Tlie 

projected  population  value  was  .814. 


For  the  sample  of  250  examinees,  only  one  item  analytic  technique 
(factor  analysis)  produced  an  internal  consistency  estimate  that  was 
statistically  different  from  the  projected  population  estimate  at  a 
confidence  level  of  95  percent. 

However,  all  three  techniques  produced  tests  with  internal  con- 
sistency estimates  significantly  different  from  the  projected  population 
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estimate  when  the  sample  was  increased  to  500,  and  995.   Thus,  when 
the  number  of  examinees  was  large,  each  of  the  three  teclmiques 
procedures  tests  with  higher  internal  consistency  estimates  than  if  the 
test  were  produced  by  randomly  deleting  items. 

For  the  30  item  tests,  the  effect  of  decreasing  the  sample  size 
tended  to  decrease  internal  consistency  for  each  method  (Table  16) . 
But  the  decrease  was  very  slight. 

The  standard  error  of  measurement  was  essentially  tlie  same  for 
the  three  methods  of  item  analysis  across  the  varying  sample  sizes. 

Two-way  analyses  of  variance  were  run  on  item  difficulties  and 
item  discriminations  for  the  30  item  tests,  similar  to  those  run  for 
the  15  item  tests.   Again,  the  independent  variables  were  item  analytic 
technique  and  sample  size,  each  containing  tliree  levels. 

No  significant  differences  were  observed  for  item  difficulty 
for  the  independent  variables  of  item  analytic  technique,  sam.ple  size, 
or  their  interaction,  F.  (2,261)  =  .46,  p  >  .05;  l_   (2,261  =  .27,  p  >  .05; 
F_  (4,261)  =  .24,  p  >  .05  respectively. 

No  significant  differences  were  observed  for  the  transformed 
biserial  correlations   for  the  independent  variables  of  item  analytic 
technqiue,  sample  size,  or  their  interaction,  F.  (2,261)  =  1.97,  £  >  .05; 
F  (2,261)  =  .74,  £  >  .05;  F_  (4,261)  =  .48,  £  >  .05  respectively. 


When  the  actual  biserial  correlations  were  tested  in  the  two-way 
analysis  of  variance  design  similar  F  values  were  observed. 
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The  means,  standard  deviations,  and  ranges  of  tlie  item  difficulties 
and  transformed  biserial  correlations  based  upon  the  30  item  tests 
have  been  presented  m  Tables  18  and  19  respectively. 

TABLE  18 

30  ITBl  TESTS:   DESCRIPTIVE  STATISTICS  FOR  ITEM 
DIEFICULTY  BY  PROCEDURE  AND  SAMPLE  SIZE 


P 

rocedure 
Factor 

S 

ample  Size 

Classical 

Analysis 

Rasch 

250 

500 

995 

Mean 

.58 

.60 

.59 

.58 

.60 

.59 

Standard 

Deviation 

.17 

.16 

.17 

.18 

.16 

.16 

Range 

.21-. 91 

.29-. 92 

.23-. 90 

.23-. 92 

.24-. 92 

.21-. 91 

TABLE  19 

30  ITEM  TESTS:   DESCRIPTIVE  STATISTICS  FOR  ITEM 
DISCRIMINATIONS^^  BY  PROCEDURE  AND  SAMPLE  SIZE 


Procedure 

Sample  Size 

Factor 

Classica 

1  Analysis 

Rasch 

250 

500 

995 

Mean 

41.02 

41.61 

38.56 

39.42 

40.37 

41.40 

Standard 

Deviation 

11.82 

10.85 

9.93 

12.45 

9.87 

10.34 

Range 

9-70 

19-68 

13-64 

9-70 

21-68 

24-66 

Based  on  transformed  biserial  correlations.   The  transformation  was  a 
linear  transformation  of  the  Fisher  z_  statistic  and  multiplication 
of  the  constant  60.241  providing  a  range  of  0-100  for  the  biserial 
correlation  (Davis,  1946). 
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Comparison  of  the  3U  Item  Tests  on  nfficiency 
Lord  (1974a,  1974b)  proposed  the  formula  used  for  approximating 
the  relative  efficiency  for  tvv-o  tests,  stated  previously  in  equation 
15  as: 

V    .    xCn   -  xj  f 

R.E.  Cy,x)  =  -^^-      ^ ^  ' 

X      y(ny  -  y)f^ 

where  R.E.  (y,x)  denotes  the  relative  efficiency  of  y  compared  to  x, 

n  and  n  are  tlie  numbers  of  items  in  the  two  tests,  x  and  y  are  the 

2      '^ 
number-right  scores  having  the  same  percentile  rank,  and  f  and  f  are 

'  X      y 

the  squared  observed  frequencies  of  x  and  y  obtained  from  frequency 

distributions  for  similar  groups  of  examinees.   A  careful   examination 

of  the  formula  for  relative  efficiency  indicated  tliat  when  n  =  n  and 

X    y 

X  =  y,  that  it  was  the  number  of  examinees  at  the  specified  ability 

2      2 
level  (f  and  f  )  that  determined  the  efficiency  of  the  test.   That  is, 

X      y  '  ' 

the  fewer  examinees  observed  at  a  particular  percentile  rank,  the  better 
the  test  discriminates  at  that  percentile  rank.  Therefore,  test 
efficiency  was  equated  with  the  level  of  discrimination  the  test 
was  able  to  make  between  examinees,  at  various  scores  or  percentile 
ranks. 

Three  relative  efficiency  comparisons  were  made  using  the  30  item 
tests  based  on  the  sample  of  995  examinees.   The  three  comparisons  were: 
[a)  tlie  test  developed  from  factor  analysis  was  compared  to  the  test 
developed  by  classical  item  analysis  (b)  the  test  developed  by  Rasch 
analysis  was  compared  to  the  test  developed  by  classical  item  analysis, 
and  (c)  the  test  developed  from  the  Rascli  analysis  was  compared  to  the 
factor  analytically  developed  test. 


Hie  efficiency  curves  for  the  thi-ce  comparisons  were  shown  in 
Figure  2.   The  relative  efficiency  value  was  plotted  on  the  ordinate, 
while  the   percentile  rank  (student  ability  level]  was  plotted 
along  the  abscissa.   Computed  values  for  the  relative  efficiency 
comparisons  liave  been  reported  in  Appendix  B.   A  relative  efficiency 
of  1.00  would  indicate  that  the  tests  are  equally  efficient. 

T\\e   test  developed  by  factor  analysis  was  more  efficient  for  the 
lower  tenth  of  the  pupils  when  compared  to  the  test  developed  from 
classical  item  analysis.   Both  the  tests  were  about  equally  efficient 
for  the  middle  ability  groups  and  high  ability  groups, 

llie  Rasch  developed  test  was  more  efficient  than  the  test  based 
on  classical  item  analysis  for  average  to  high  ability  students 
(40th-90th  percentile  rank).   However,  it  was  less  efficient  than  the 
classical  item  analysis  test  for  students  with  very  low  or  very  high 
abilities  (lst-20tl"i  percentile  rank  and  98th  percentile  rank]. 

When  compared  to  the  factorial ly  developed  test,  the  Rasch  test 
was  again  more  efficient  for  students  of  average  to  high  abilities 
(50th-90th  percentile  rank] .   The  factorially  developed  test  appeared 
more  efficient  for  the  very  low  and  very  high  ability  students 
(lst-20th  percentile  rank  and  98th  percentile  rank). 

Suimnary 

Tlie  results  reported  in  this  chapter  are  suiranarized  for  each  of 
the  five  hypotheses. 

Hypothesis  1.   There  are  no  significant  differences  in  the 
internal  consistency  estimates  of  the  tests  produced  by  the  tliree 
methods,  as  the  nimiber  of  items  decreases,  when  compared  to  the 
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RASCH  ANALYSIS  COMPARED  TO  FACTOR  ANALYSIS 

FIGURE  2.    RELATIVE  EFFICIENCY  COMPARISONS 

FOR  THE  THREE  30  ITEM  TESTS  N=995. 
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projected  internal  consistency  estimates  for  the  population  for  tests 
of  similar  length. 

Confidence  intervals  were  calculated  to  test  for  differences  between 
the  observed  internal  consistency  estimates  and  the  internal  consistency 
estimate  for  the  population.   As  reported  in  Tables  12  and  17,  for  the 
15  and  50  item  tests,  15  of  the  18  confidence  intervals  (at  the  95  percent 
level)  generated  around  the  sample  estimate  did  not  contain  the  population 
value.   This  means  that  15  of  the  observed  internal  consistency  estimates 
were  superior  to  the  population  values  projected  for  subtests  of  similar 
length  created  by  random  deletion  of  items.   Therefore,  hypothesis  one 
was  not  supported.   The  procedures  that  produced  the  three  observed 
internal  consistency  estimates  that  were  not  significantly  different 
from  the  population  value,  and  hence  no  different  than  would  be  expected 
by  random  item  deletion,  were  the  Rasch  procedure  (15  item  test,  N  =  500; 
50  item  test,  N  =  250)  and  the  classical  item  analysis  procedure  (50 
item  test,  N  =  250) . 

Hypothesis  2.   There  are  no  differences  in  the  internal  consistency 
estimates  of  the  tests  produced  by  the  three  methods  when  the  number 
of  examinees  is  decreased. 

Hypothesis  two  was  supported  for  the  15  and  50  item  tests.   Slight 
decreases  in  internal  consistency  estimates  were  noted  for  the  15 
item  test  (Table  11)  as  sample  size  decreased,  but  only  decreases  of 
one  or  two  one-hundreths  of  a  point.   Even  smaller  decreases  were 
observed  on  the  50  item  test  (Table  16) . 

H>T3othe5is  5.   There  are  no  meaningful  differences  in  the 
magnitude  of  the  standard  error  of  measurement  of  the  tests  produced 
by  the  three  methods. 
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Hypothesis  three  was  supported  for  the  15  and  30  item  tests. 
Meaningful  differences  were  defined  to  be  >_  1.00,  but  none  of  the 
three  methods  produced  tests  with  standard  errors  of  measurement  that 
differed  by  tliat  much.   In  eacli  case,  the  difference  was  approximately 
one-tenth  of  a  point  or  less  (Tables  11  and  16) . 

Hy|:)othesis  4.   There  are  no  differences  in  the  difficulties  or 
discriminations  of  the  items  selected  by  the  three  methods. 

Hypothesis  four  was  supported  for  the  15  and  30  item  tests  with 
respect  to  item  difficulty.   That  is,  the  two-way  analysis  of  variance 
revealed  no  significant  differences  for  either  the  15  or  30  item  tests 
with  regard  to  item  difficulty. 

Hypothesis  four  was  also  supported  for  item  discrimination,  but 
only  for  the  30  item  tests.   Tlie  two-way  analysis  of  variance  for  item 
discrimination  indicated  no  significant  differences  for  the  30  item 
tests;  however,  on  the  15  item  tests,  a  significant  £  ratio  (p^  <  .05) 
for  item  analytic  procedure  was  observed  for  item  discrimination. 
Tukey's  HSD  test  revealed  tliat  items  selected  by  the  Rasch  procedure 
had  significantly  lower  average  biserial  correlations  than  the  items 
selected  by  factor  analysis  and  classical  item  analysis  (Table  15) . 
This  could  have  been  expected  because  the  range  of  the  biserial 
correlation  was  restricted  when  the  items  were  originally  selected 
for  the  Rasch  model.   This  procedure  was  necessary  to  meet  one  of  the 
assumptions  for  the  Rasch  model. 

Hypothesis  5.   There  are  no  differences  across  ability  levels  in 
the  efficiency  of  the  tests  produced  by  the  three  methods. 
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H)'pothesis  five  was  not  supported.   The  efficiency  curves 
illustrated  in  Figure  2,  generally  indicated  tliat  the  tests  based  on 
classical  test  theory  were  more  effective  for  measuring  students  with 
very  low  ability  (20th  percentile  rank  or  less)  and  students  with  very 
high  abilities  (98th  percentile  rank),   llie  Rasch  developed  test  was 
most  efficient  for  assessing  average  and  high  ability  students  (40th- 
90th  percentile  rank). 


CHAPTER  V 

DISCUSSION  AND  CONCLUSIONS 

This  study  was  conducted  to  determine  which  of  the  three  item 

analytic  procedures  (classical  item  analysis,  factor  analysis,  and  the 

Rasch  model)  miglu  produce  the  superior  test  in  terms  of  the  precision 

and  the  efficiency  of  measurement.   A  common  item  and  examinee  population 

was  used  to  test  five  hypotheses.   Of  the  five  hypotheses,  three  dealt 

with  elements  of  test  precision  as  measured  by  internal  consistency 

estimates.   Another  hyjriothesis  treated  the  issue  of  item  discriminations. 

Thus,  it  too  was  related  to  internal  consistency.   The  fifth  h>qiothesis 

focused  on  the  relative  efficiency  of  the  tests  produced  by  tliree  item 

analytic  teclniitjues.   Tliis  hy]oothesis  altered  the  emphasis  of  the  study 

from  one  overall  specific  measure  of  a  test's  accuracy,  in  terms  of 

internal  consistency,  to  a  general  comparison  of  each  method  as  a 

function  of  ability  level.   The  discussion  of  the  results  then  has  been 

focused  in  two  major  areas:   (a)  the  precision  of  the  tests,  and  [b) 

the  efficiency  of  the  tests  produced  by  the  three  methods  of  item  analysis. 

The  Precision  of  the  Tests  Produced  by  the 
Three  Methods  of  Item  Analysis 

Each  of  the  three  item  analytic  techniques  was  applied  to  an  in- 
dependent sample  to  select  the  best  15  and  30  items.   The  stability  of 
the  summary  statistics  across  each  sample  size  for  the  three  item  analytic 
techniques  indicated  a  tendency  for  tlie  nine  samples  to  be  very  homogeneous, 
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The  similarity  of  the  means,  standard  deviations,  and  percentages  of 
variance  accounted  for  were  noted  on  Tables  5-8,  witli  tlie  exception 
of  the  mean  sciuare  fit  statistic  (Table  7)  whicli  increased  with 
sample  size.   (Tlvis  exception  is  discussed  later  in  this  cliapter.) 
From  these  samples,  items  were  selected  by  each  item  analytic  techniciue 
to  maximize  internal  consistency. 

The  data  reported  in  Tables  11  and  16  indicated  the  effectiveness 
of  each  item  analytic  technique  in  producing  internally  consistent 
tests.   Before  an  overall  decision  can  be  made  as  to  the  superiority  of 
one  technique  over  another,  each  of  the  h)q30theses  relating  to  precision 
must  be  considered. 
Internal  Consistency 

Data  in  Tables  11  and  16  indicate  that  tlie  two  tests  based  on 
classical  test  tlieory  (factor  analysis  and  classical  item  analysis) 
appeared  superior  in  terms  of  internal  consistency  wlien  compared  to  the 
tests  developed  by  tlie  Rasch  model. 

To  test  whether  any  of  the  three  methods  produced  tests  with 
greater  intei'nal  consistency  than  a  test  created  by  random  item  deletion, 
the  internal  consistency  estimates  were  compared  to  the  projected  internal 
consistency  value  for  the  population  by  using  confidence  intervals  as 
suggested  by  Feldt  (1965).   In  order  for  a  given  sample  internal  consis- 
tency estimate  to  be  significant,  the  population  value  could  not  be 
included  in  the  confidence  interval  generated  around  that  sample  value. 
For  the  15  item  tests,  nine  confidence  intervals  were  calculated  for 
the  nine  estimates  of  internal  consistency,  one  for  each  method  at  each 
sample  size.   Eight  of  the  nine  sample  values  were  shown  to  be  significantly 
greater  than  the  population  estimate  at  the  95  percent  confidence  level 
(Table  12).   Only  tlie  internal  consistency  estimate  of  the  Rascli  test. 
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based  on  the  sample  of  500  examinees,  failed  to  reach  a  level  significantly 
greater  than  would  have  been  expected  by  chance. 

For  the  30  item  tests,  nine  confidence  intervals  were  also  calculated 
for  the  nine  estimates  of  internal  consistency,  one  for  eacli  method 
at  each  sample  size.   Seven  of  the  nine  sample  internal  consistency 
estimates  were  shown  to  be  significantly  greater  than  the  population 
estimate  at  the  95  percent  confidence  level  (Table  17) .   The  tests 
based  on  classical  item  analysis  and  Rasch  analysis,  for  the  sample  of 
250  examinees,  were  not  significantly  different  from  the  projected 
population  value  for  a  30  item  test  created  by  random  item  selection. 
Therefore,  for  smaller  samples  (N  =  250]  factor  analysis  appeared 
to  be  superior  to  classical  item  analysis  and  the  Rasch  analysis  in 
producing  the  most  precise  test. 

Generally,  as  tlie  number  of  examinees  decreased  so  did  the  internal 
consistency  estimates.   However,  the  tests  based  on  factor  analysis  were 
least  affected  by  decreasing  the  sample  sizes  used  in  the  cross-validation 
for  the  15  and  30  item  tests  (Tables  11  and  16). 
Standard  Error  of  Measurement 

The  standard  error  of  measurement  is  the  standard  deviation  of  the 
distribution  of  errors  surrounding  an  individual's  observed  score  on 
an  infinite  number  of  parallel  tests.   Hence  the  smaller  the  standard 
error  of  measurement,  the  greater  the  precision  of  the  measurement.   This 
statistic  is  often  considered  a  more  meaningful  measure  of  an  instrument's 
reliability  than  tlie  reliability  coefficient  itself  (Magnusson,  1966, 
p.  82).   Based  on  the  data  for  this  study,  the  standard  errors  of 
measurement  were  consistently  smaller  for  both  the  15  and  30  item  tests 
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produced  by  classical  test  theory  as  compared  to  the  15  and  30  item 

tests  based  on  the  Rasch  model;  however,  the  differences  in  the  standard 

errors  of  measurement  did  not  equal  or  exceed  1.00  for  any  of  the 

methods. 

Types  of  Items  Retained 

Item  difficulty.   Item  difficulties  of  the  15  and  50  item  tests 
were  analyzed  in  separate  two-way  analyses  of  variance.   The  two 
independent  variables  were  sample  size  and  item  analytic  technique. 
No  significant  £  ratios  were  observed  for  eiti\er  the  15  or  30  item 
tests  on  item  difficulty.   Therefore,  each  item  analytic  technique 
tended  to  select  items  whicli  had  similar  item  difficulties  on  the 
average. 

Item  discrimination.   In  this  study,  the  item  discriminations  were 
measured  by  biserial  correlation.   Transformed  biserial  correlations 
for  the  15  and  30  item  tests  were  analyzed  in  separate  two-way  analyses 
of  variance.   The  two  independent  variables  were  sample  size  and  item 
analytic  technique.   For  the  15  item  tests,  a  significant  F  ratio 
(£  <  .05)  was  observed  for  the  main  effect  of  item  analytic  technique. 
The  mean  biserial  correlation  for  each  15  i tem  test s  were  44  for  the 
Rasch  test,  54  for  the  classical  item  analysis  test,  and  54  for  the 
factorial ly  developed  test.    Tukey's  USD  post  hoc  comparison 
indicated  that  the  items  selected  by  the  Rasch  procedure  had  lower 
biserial  correlations,  on  the  average,  than  items  selected  on  the  basis 
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The  actual  mean  biserial  correlations  for  the  three  tests 
corresponding  to  the  transformed  biserial  correlations  were: 
.62,  .71,  ,71  respectively. 
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of  factor  analysis  or  classical  i t om  analysis.   It  should  be  noted 
that  this  finding  was  due  to  the  fact  that  the  range  of  tlie  biserial 
correlations  was  restricted  to  .33  -  .79  on  the  items  selected  for 
the  Rasch  calibration.   This  was  necessary  to  meet  the  assumption 
of  equal  item  discriminations. 

However,  when  the  test  length  was  increased  to  50  items,  no 
significant  ¥_   ratios  were  observed  for  the  variable  of  item  discrimina- 
tion.  The  difference  in  these  two  findings  for  the  15  and  50  item 
tests  can  be  explained  by  the  way  the  15  and  50  item  tests  were  constructed. 
The  15  item  test  was  made  up  of  the  15  items  with  the  highest  biserial 
correlations.   The  50  item  test  was  made  up  of  the  above  15  items  and 
an  additional  set  of  15  items  with  the  next  highest  biserial  correlations. 
The  addition  of  15  more  items  meant  that  their  average  biserial  correla- 
tion was  something  less  than  the  original  15  items.   Hence,  the  mean 
biserial  correlations  were  reduced  for  the  longer  50  item  tests 
[Table  19) . 
Conclusions 

From  the  data  presented  for  each  of  the  four  areas  above,  it  was 
concluded  that  each  of  the  three  item  analytic  techniques  tended  to 
produce  tests  that  were  really  no  different  in  terms  of  the  precision 
of  measurement. 

Thus,  the  question  to  consider  now  is:   Should  practitioners 

in  the  field  of  measurement  spend  their  time  learning  to  use  the  Rasch 

12 
model  to  develop  cognitive  norm-referenced  tests   knowing  the  extra 


11  .      .  r 

"Tlie  field  of  test  development  is  limited  to  cognitive  norm-retcr- 
enced  tests  because  tliat  was  the  type   of  instrument  used  in  this  study. 
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work  and  sopliisticatioii  of  knowledp,e  required  to  effectively  use  the 
Rasch  procedures?   Witli  the  criterion  of  internal  consistency  as  a 
measure  of  test  superiority,  it  appeared  from  this  study  that  time 
spent  factorial ly  developing  tests,  or  if  computer  facilities  were  not 
available,  tlie  use  of  classical  item  analysis  procedures  seem  more 
than  adequate  for  good  test  construction. 

However,  it  must  be  remembered  that  internal  consistency  may 
not  be  a  fair  and  sufficient  criterion.   Internal  consistency  is  an 
integral  part  of  classical  test  theory  and  may  be  biased  since  it  was 
derived  from  the  classical  model.   Following  tliat  reasoning,  Whitely 
and  Dawis  (1974)  liave  commented  on  the  precision  of  tests  developed 
using  classical  item  analysis  and  Rasch  analysis.   They  stated  that  if 
the  goal  of  item  selection  was  to  develop  fixed-content  tests,  then 
the  classical  teclmiques  of  item  selection  will  yield  the  more  precise 
test  since  precision  is  specific  to  tlie  trait  distribution  in  a  given 
test.   Whitely  and  Dawis  indicated  that  tlie  strength  of  the  Rasch 
analysis  was  in  the  individualized  selection  of  items,  as  in  tailored 
testing,  rather  tlian  the  construction  of  fixed-content  tests. 

In  considering  the  above  situation,  Lord  (1974b)  has  stated  that 
internal  consistency  is  an  overall  estimate  of  a  test's  homogeneity, 
but  provides  no  information  on  how  the  test  as  a  whole  discriminates 
for  the  various  ability  groups  taking  tlie  test.   Thus,  the  three 
techniques  of  test  development  were  compared  using  an  additional  criterion, 
relative  efficiency. 
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Tlic  F.ffjcicncy  of  ;   Tests  Produced  by  the 
Tliree  Methods  of  Item  Analysis 

The  review  of  the  literature  concerning  the  relative  efficiency 
of  a  test  presented  in  Chapter  1I-,  cited  studies  mainly  dealing  with 
latent  trait  theory  (Birnbaum,  1968;  Hambleton  and  Traub,  1971,  1973). 
Studies  comparing  the  relative  efficiency  of  tests  developed  by  latent 
trait  theory  to  classical  test  theory  appear  to  be  missing  from  the 
literature  on  test  efficiency. 

The  relative  efficiency  formula  (Lord,  1974a,  1974b)  was  not 
derived  for  any  specific  test  development  theory;  therefore,  relative 
efficiency  estimates  should  be  applicable  to  any  test  development 
technique.   Lord  (1974b)  suggested  his  formula  may  not  work  well  for 
extremely  short  tests  and  that  it  should  only  be  used  on  large  samples 
of  examinees.   Thus,  in  the  present  study,  only  the  30  item  tests  were 
compared  using  the  sample  of  995  examinees.   The  three  comparisons  of 
relative  efficiency  were:   (a)  the  test  based  on  factor  analysis  was 
compared  to  the  test  based  on  classical  item  analysis,  (b)  tlie  test 
based  on  the  Rasch  analysis  was  compared  to  the  test  based  on  classical 
item  analysis,  and  (c)  the  test  based  on  the  Rasch  analysis  was  compared 
to  the  test  based  on  factor  analysis. 

Generally,  the  results  indicated  that  the  Rasch  test  was  superior 
to  the  two  tests  based  on  classical  test  theory  for  students  of  average 
and  high  ability.   The  two  tests  based  on  classical  test  theory,  how- 
ever, were  superior  in  efficiency  to  the  Rasch  developed  test  for  very 
low  and  very  high  ability  students  (Figure  2).   Test  efficiency  has  been 
defined  as  a  measure  of  how  well  a  test  is  able  to  discriminate  between 
examinees  of  varying  abilities.   Therefore,  the  test  constructor  must 
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ask  himself,  for  which  segment (s)  of  the  examinee  population  is  the 
test  intended  to  discriminate? 

In  this  study,  tlie  test  under  consideration  was  a  verbal  aptitude 
college  admissions  test.   Usually  college  admissions  officers  are 
interested  in  selecting  students  who  will  be  successful  once  admitted 
to  college.   The  examinees  wtio  score  very  high  on  college  admissions 
tests  will  generally  be  admitted  to  college  witliout  any  question.   Thus, 
it  is  less  important  to  be  able  to  discriminate  among  the  very  high 
scoring  examinees  tlian  to  discriminate  among  the  students  who  score  near 
the  mean  or  in  the  upper  middle  range  on  a  college  admissions  test. 
For  these  students  it  is  difficult  to  decide  who  sliould  be  admitted 
and  who  sliould  be  denied  admittance.   If  it  is  known  that  the  admissions 
test  discriminates  very  well  for  average  to  high  ability  students,  then 
tlie  reliability  of  the  selection  process  based  on  test  scores  should 
be  increased.   The  data  in  this  study,  therefore,  indicate  tliat  the 
test  based  on  the  Rasch  analysis  would  be  most  efficient  for  selecting 
the  average  to  'nigh  ability  students  for  admission  to  college. 

Lord  (196S)  has  illustrated  a  very  important  feature  of  test 
information  and  relative  efficiency  curves.   He  has  shown  tliat  the 
contribution  of  each  item  to  a  test  is  independent  of  all  other  items. 
Thus,  when  information  curves  are  available  on  a  pool  of  items  they  can 
be  added  to  a  test  to  achieve  a  prespecified  information  or  relative 
efficiency  curve  for  any  subpopulation  of  examinees  (Lord,  1968) . 
Therefore,  by  using  measures  such  as  relative  efficiency  and  information 
curves  psychometricians  are  able  to  develop  very  discriminating  tests 
for  any  segment  of  the  population. 
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Conclusions 

It  has  been  suggested  that  because  efficiency  and  test  information 
curves  are  a  function  of  ability,  tliese  estimates  ought  to  replace 
the  use  of  classical  reliability  estimates  and  the  standard  error  of 
measurement  in  test  score  information  (Hambleton  et  al.,  1977).   This 
suggestion  certainly  deserves  some  consideration  in  light  of  the  present 
study  where  it  was  shown  that  the  three  methods  of  item  analysis 
produced  similar  tests  in  terms  of  precision,  but  the  three  methods 
produced  very  different  tests  in  teimis  of  efficiency.   Today,  with 
the  increasing  use  of  computers  in  test  construction,  perhaps  the 
more  meaningful  question  to  be  asked  by  psychometricians  is;   For 
which  ability  group  is  the  test  superior?  Only  test  information  curves 
and  measures  of  relative  efficiency  can  answer  that  question. 

Implications  for  Future  Research 

The  results  of  this  empirical  study  revealed  that  tests  developed 
using  classical  test  theory,  in  spite  of  its  inherent  weaknesses,  were 
no  different  with  respect  to  precision  of  measurement  than  tests 
developed  using  one  of  the  latent  trait  models,  the  one-parameter 
Rasch  model.   Comparisons  of  relative  efficiency  for  the  30  item  tests 
showed  that  the  tests  based  on  classical  test  theory  were  superior  to 
the  Rasch  developed  test  for  very  low  and  very  high  scoring  examinees, 
and  the  Rascli  developed  test  was  more  efficient  for  average  to  high 
scoring  examinees  on  the  verbal  aptitude  college  admissions  subtest 
used  in  this  study. 

Only  one  of  the  four  latent  trait  models,  the  Rasch  model, 
was  used  in  this  comparative  study  of  test  development  techniques. 
Perhaps  it  was  the  very  nature  of  tliis  simple  model  that  resulted  in  the 
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devc  lo])mcnt  of  ctjuivalcnt  tests  when  compared  to  tlic  tests  developed 
by  classical  item  analysis  and  factor  analysis  in  terms  of  precision. 
It  may  be  that  the  more  technical-  two-  and  three-parameter  logistic 
models  would  have  produced  tests  comparable  or  superior  to  those 
developed  by  classical  test  theory  in  terms  of  precision  of  measurement 
and  overall  relative  efficiency. 

The  two-parameter  model  allows  for  varying  item  discriminations  so 
that  the  initial  selection  of  items  would  not  have  to  have  been  restricted 
to  a  prespecified  range.   The  three-parameter  model  not  only  allows 
for  varying  item  discriminations,  but  also  for  the  effects  of  guessing 
on  the  test.   It  is  reasonable  to  suspect  that  guessing  may  have  been 
a  factor  in  the  item  scores  for  the  t>T3e  of  cognitive  test  used  in  the 
present  study.   Thus,  before  the  findings  of  the  study  can  be  generally 
accepted,  replication  is  needed  using  not  only  other  populations  and 
other  instruments,  but  also  other  latent  trait  models. 

If  other  latent  trait  models  are  to  be  considered  in  addition  to 
the  Rascli  model,  several  points  need  to  be  evaluated.   The  Rasch  model 
is  the  only  latent  trait  model  that  provides  for  the  direct  calibration 
of  items  and  abilities  based  on  unweighted  "number  right"  scoring 
(Wright,  1977).   The  two-  and  three-parameter  models  require  a  more 
complex  scoring  system  where  the  item  response  is  weighted  in  order  to 
estimate  the  discrimination  and  guessing  parameters.   The  weigliting 
is  an  iterative  process  that  may  never  converge  or  stablize  unless 
arbitrary  boundaries  are  established  [Wriglit,  1977,  p.  104).   Dccause 
of  tliis  complex  scoring  system,  the  two-  and  thrcc-iniramcter  logistic 
models  are  less  efficient  for  parameter  estimation  than  the  one-jiaram- 
eter  model  in  terms  of  computer  time. 
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A  further  criticism  of  the  two-  and  tiiree-parameter  logistic  models 
has  been  offered  by  Wright  (1977)  concerning  the  additional  item 
parameters.   If  item  discrimination  parameters  and  item  guessing 
parameters  are  introduced  into  a  theory  of  measurement,  why  not 
person  parameters  for  sensitivity  to  difficult  items,  and  inclination 
toward  guessing?  Wright  questions  wliether  psychometricians  really 
need  to  make  measurement  theory  so  complex. 

A  final  point  to  consider  when  comparing  the  latent  trait  models 
is  that  only  the  one-parameter  Rasch  model  provides  a  ratio  scale 
of  measurement  in  terms  of  the  calibrated  item  and  ability  scores 
(Hambleton  et  al.,  1977).   Success  on  a  particular  item  is  given  by 
the  product  of  the  person's  ability  and  the  item's  easiness.   Tlius,  the 
person  with  no  ability  will  have  zero  odds  or  probability  of  success 
on  any  item.   The  same  logic  applies  to  items  with  no  easiness  (zero 
difficulty)  they  cannot  be  solved.   Thus,  measurements  made  with  Rasch 
calibrated  items  are  on  a  ratio  scale;  and  it  is  the  ratio  scale  of 
measurement  that  leads  to  the  concept  of  specific  objectivity. 

Tlierefore,  it  seems  that  each  latent  trait  model  has  its  own 
set  of  advantages  and  disadvantages  and  sliould  be  considered  if  com- 
parisons are  to  be  made  to  the  classical  models  for  test  development 
purposes. 

Additional  areas  for  future  research  may  lead  to  actual  comparisons 
of  the  content  of  the  items  selected  by  each  of  the  three  item  analytic 
techniques.   Davis  (1951)  and  Co.\  (1965)  have  criticized  the  selection 
of  items  solely  on  statistical  criteria.   The  use  of  statistical  criteria 
alone,  may  result  in  changing  the  nature  of  the  trait  being  measured 
by  deleting  the  items  essential  to  adequate  coi;tent  coverage.   Whitely 
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and  Dawis  fioyb)  found  this  to  be  true  in  a  study  of  verbal  analogy 
items  wliere  the  type  of  relationship  and  content  of  specific  analogy 
items  proved  to  be  quite  significant  when  studied  in  isolation. 
A  restricting  factor  in  this  stud)'  was  tliat  a  prespecified 
number  of  test  items  were  selected,  e.g.,  15  and  50,  by  each  item 
analytic  procedure.   For  the  Rasch  procedure,  perhaps  some  number  of 
items  less  than  30,  but  greater  than  15  may  have  provided  a  better  fit 
to  the  model.   This  procedure  might  have  produced  mean  square  fit 
statistics  that  were  equivalent  across  the  sample  sizes  rather  than 
increase  witli  sample  size  as  found  in  this  study.   Thus,  selecting 
the  number  of  items  precisely  fitting  the  Rasch  model  and  comparing 
that  number  of  items  with  classical  item  analysis  and  factor  analysis 
miglit  liave  led  to  different  conclusions  than  those  made  in  the  present 
study  with  regard  to  the  precision  and  relative  efficiency  of  measurement, 


CMPTF.R  VI 
SlIffMARY 

The  quality  of  the  items  in  a  test  determine  its  validity  and 
reliability.   Tlirough  the  application  of  item  analysis  procedures,  test 
constructors  are  able  to  obtain  quantitative  objective  information  useful 
in  developing  and  judging  the  quality  of  a  test  and  its  items. 

Classical  test  tlieory  forms  the  basis  for  one  method  of  test 
development.   An  integral  part  of  the  development  of  tests  based  on  the 
classical  model  is  the  utilization  of  classical  item  analysis  or  factor 
analysis.   Classical  item  analysis  is  a  procedure  to  obtain  a  description 
of  the  statistical  characteristics  of  each  item  in  the  test.   This 
approach  requires  identification  of  single  items  whicli  provide  maximum 
discrimination  between  individuals  on  the  latent  trait  being  measured. 
Theoretically,  selecting  items  which  have  a  high  correlation  with  total 
test  score  will  result  in  a  discriminating  test  which  is  homogeneous 
with  respect  to  the  latent  trait.   Therefore,  classical  item  analysis 
is  an  aid  to  developing  internally  consistent  tests. 

An  alternative  metliod  of  test  development,  but  based  on  the 
classical  model,  is  factor  analysis.   Factor  analysis  is  a  more  complex 
test  development  procedure  than  classical  item  analysis.   It  is  a 
statistical  technicjue  that  takes  into  account  the  item  correlation  with 
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all  other  individual  items  in  the  test  simultaneously.   Groups  of 
similar  items  tend  to  cluster  together  and  comprise  the  latent  traits 
(factors)  underlying  the  test.   Thus,  under  the  classical  model  then, 
classical  item  analysis  can  be  viewed  as  a  unidimensional  basis  for 
item  analysis,  less  sophisticated  than  the  multidimensional  procedure 
of  factor  analysis. 

Classical  item  analysis  and  factor  analysis  have  long  been  the  only 
techniques  described  in  measurement  texts  for  use  in  test  development 
(Baker,  1977).   However,  with  the  publication  of  Lord  and  Novick's 
Statistical  Theories  of  Mental  Test  Scores,  (19b8)  considerable  attention 
is  being  directed  now  toward  the  field  of  latent  trait  theory  as  a  new 
area  in  test  development.   Proponents  of  this  approach  claim  that  the 
advantages  of  latent  trait  theory  over  classical  test  theory  are 
twofold:   (a)  theoretically  it  provides  item  parameters  that  are  in- 
variant across  examinee  samples  which  will  differ  with  respect  to  the 
latent  trait,  and  (b)  it  provides  item  characteristic  curves  that  give 
insight  into  how  specific  items  discriminate  between  students  of 
varying  abilities. 

Four  latent  trait  models  have  been  developed  for  use  with 
dichotomously  scored  data:   The  normal  ogive,  and  the  one-,  two-,  and 
three-parameter  logistic  model  (Hambleton  and  Cook,  1977;  Lord  and 
Novick,  1968).   This  study  was  concerned  with  the  one-parameter  logistic 
Rasch  model  because  it  is  the  simplest  of  the  four  models. 

A  review  of  tlie  literature  revealed  numerous  studies  conducted 
in  each  of  the  tliree  areas  of  item  analysis,  but  relatively  few 
comparative  studies  were  reported  between  the  three  methods.   Missing 
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from  the  review  were  comparative  studies  among  all  three,  item  analytic 
techniques.   Therefore,  the  present  study  was  designed  to  compare  the 
methods  of  classical  item  analysis,  factor  analysis,  and  the  Rasch  model 
on  measures  of  precision  and  relative  efficiency  in  test  development. 

An  empirical  study  was  designed  to  compare  the  effects  of  the  three 
methods  of  item  analysis  on  test  development  across  different  sample 
sizes.   Item  response  data  were  obtained  from  a  sample  of  5,235  high 
school  seniors  on  a  cognitive  test  of  verbal  aptitude. 

The  subjects  were  divided  into  9  samples:   three  independent 
groups  of  250  subjects  each,  three  independent  groups  of  500  subjects 
each,  and  three  independent  groups  of  995  subjects  eacli.   The  independent 
groups  were  obtained  so  that  tests  of  statistical  significance  could 
be  performed. 

The  item  response  data  were  then  analyzed  in  three  phases.   First, 
the  "best"  15  and  30  items  were  selected  using  each  item  analytic 
technique.   Under  classical  item  analysis,  the  best  15  and  30  items 
were  selected  based  on  the  highest  biserial  correlations.   For  factor 
analysis,  the  best  15  and  30  items  were  selected  based  on  the  highest 
item  loadings  on  the  first  (unrotated)  principal  component.   The 
selections  of  the  best  15  and  30  items  using  tlie  Rasch  model  were 
based  upon  the  mean  square  fit  of  the  items  to  the  model.   These 
procedures  were  used  for  each  group  of  subjects.   Second,  a  double 
cross-validation  design  was  employed  to  obtain  estimates  on  the  item  and 
test  parameters  for  the  best  15  and  30  items.   Tlie  items  selected  from 
tlic  three  item  analytic  teclmiques  were  scored  for  different  samples 
of  subjects  by  randomly  reassigning  the  samples  wliich  Iiad  been  used 
in  the  original  item  analysis. 
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Then,  the  best  15  and  30  itc.is  chosen  by  each  method  were  submitted 
to  a  common  item  analytic  procedure  in  order  to  obtain  estimates  for 
comparing  the  three  item  analytic  methods.   Third,  a  two-way  analysis 
of  variance  and  a  Tukey  HSD  post  hoc  comparison  test,  when  indicated, 
were  used  to  test  for  differences  in  the  properties  of  items  selected 
by  each  item  analytic  procedure.   Also  confidence  intervals  were 
calculated  to  compare  the  internal  consistency  estimates  to  a  population 
value.   In  addition,  the  relative  efficiencies  of  the  30  item  tests 
developed  by  each  item  analytic  technique  were  compared  for  the  sample 
of  995  subjects. 

The  results  of  the  analysis  showed  that  there  were  no  apparent 
differences  in  the  types  of  tests  produced  by  the  three  methods  of  item 
analysis  in  terms  of  the  precision  of  measurement.   The  three  methods  were 
compared  on  measures  of  internal  consistency,  the  standard  error  of 
measurement,  mean  item  difficulty,  and  mean  item  discrimination. 

Confidence  intervals  were  generated  around  the  observed  internal 
consistency  estimates  for  the  15  and  30  item  tests  produced  by  each 
method.   The  confidence  intervals  were  obtained  to  compare  the  observed 
internal  consistency  estimates  from  each  test  to  the  project  population 
internal  consistency  for  tests  of  a  similar  length  as  suggested  in 
Feldt  (1965) .   The  projected  population  value  (obtained  via  Spearman- 
Brown  Prophecy  Formula)  represented  what  the  test's  internal  consistency 
would  have  been  for  a  test  created  by  deleting  items  at  random.   Of  the 
18  confidence  intervals  calculated  at  a  95  percent  level  of  confidence, 
15  did  not  contain  tlie  projected  population  value.   Therefore,  it  was 
concluded  that  the  three  item  analytic  techniques  were  significantly 
different  from  random  item  deletion  in  producing  tests  with  liigher 
internal  consistenc\'  estimates.   It  was  also  noted  that  as  the 
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number  of  examinees  decreased  so  did  the  internal  consistency 
estimates. 

The  standard  error  of  measurement  was  consistently  smaller  for 
both  the  15  and  30  item  tests  produced  by  classical  test  theory  when 
compared  to  tlie  15  and  50  item  tests  based  on  the  Risch  model.   However, 
the  differences  in  the  standard  errors  of  measurement  did  not  equal  or 
exceed  1.00  for  any  of  the  methods. 

No  significant  F  ratios  were  observed  for  either  15  of  30  item 
tests  on  item  difficulty.   Therefore,  each  item  analytic  technique  tended 
to  select  items  which  had  similar  item  difficulties  on  the  average. 

For  the  variable,  item  discrimination  on  the  15  item  tests,  a 
significant  £  ratio  (p^  <  .05)  was  observed  for  the  main  effect  of  the 
item  analytic  technique.   Tukey's  USD  post  hoc  analysis  indicated  that 
tlie  Rasch  test  tended  to  contain  items  with  lower  biserial  correlations, 
on  the  average,  tlian  tests  procedured  by  factor  analysis  and  classical 
item  analysis.   This  finding  was  probably  due  to  the  fact  that  the  range 
of  the  biserial  correlations  was  restricted  to  .39  -  .79  for  items 
retained  in  the  Rasch  analysis.   However,  when  the  test  length  was 
increased  to  30  items,  no  significant  F_  ratios  were  observed  for  the 
variable  of  item  discrimination. 

In  terms  of  the  test  efficiency,  the  results  indicated  substantive 
differences  in  tlie  tests  produced  by  the  three  methods  of  item  analysis. 
The  30  item  test  for  the  sample  of  995  examinees  was  used  in  this  com- 
parison.  It  was  found  that  the  Rasch  developed  test  was  superior  to 
the  two  tests  based  on  classical  test  theory  for  students  of  average  to 
high  ability.   The  tests  based  on  classical  test  theory,  however,  were 
superior  in  efficiency  to  the  Rasch  test  for  students  of  very  low 
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# 
or  very  high  ability.   In  light  of  these  findings,  it  was  suggested 
that  measures  of  test  efficiency  ouglit  to  be  incorporated  into  the 
test  development  procedures,  as  it  provides  much  more  detailed 
information  on  how  the  test  discriminates  for  various  ability  groups 
than  does  a  single  overall  estimate  of  a  test's  homogeneity. 
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APPl-NDIX  A 
MATHEMATICAL  DERIVATION  OF  mE   RASCH  MODEL^ 


Summarized  from  Ryan  (1977) 


.NLVniEM-VriCAL  DERIVATION  OF  lllE  lUSCll  MODEL 

If  ability  is  equal  to  0  and  item  difficulty  is  equal  to  i;  then 
the  odds  of  correctly  solving  an  item  is  given  by 

,^       e        ,  (1) 

odas  =  : ^  ' 

s 

Ulien  6>£,  the  person  will  get  the  item  right,  when  9<c  the  person  will 

get  the  item  wrong,  and  when  9  =  ^  the  person  will  have  a  50-50 

chance  for  success  on  the  item. 

Tlie  probability  of  a  correct  response  can  readily  be  derived  from 

the  statement  of  the  odds.   In  general, 

odds 
Probability  =  '         '■'''' 

1  +  odds 

substituting  (1)  into  (2) 

P  -  •  C5) 

1  +  e/c 

which  is   equivalent   to 

p  .  ^/^  =  ^1^  -.       ^  ^    (4) 

c/c  +  e/c  c  +  Q/c        c  +  e 

Formula  4  is  the  prLbabiHty  of  a  correct  response. 

Tlie  formula  of  an  incorrect  response  is  represented  by  Q,  which  is 
1-P.   Thus, 

Q  =  (1-P)  =  1  -     ^      ,  (5) 

;  +  6 


which  is  equivalent  to 


Q  =  c  +  9  -   6    =    c    .  (6) 

c.  +   Q        Q   +  i>        C  +  9 
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Formula  6  is  the  probability  of  an  incorrect  response. 

Tlie  relationship  of  these  probability  statements  to  the  statement 
of  the  odds  is  given  by 


Q     ^/c  +  e      c 

The  right  side  of  equation  7  is  simply  the  odds  as  defined  in  equation  1. 
The  left  side  of  the  equation  is  the  probability  of  a  correct  response 
divided  by  the  probability  of  an  incorrect  response.   The  probability  of 
a  correct  response,  P,  is  estimated  by  the  proportion  of  examinees  in 
a  sample  who  correctly  answer  an  item.   On  a  test  of  k  items,  the  pro- 
bability of  a  person  with  score  of  j  (where  1  ^  j  ^  1<) ,  correctly 
answering  a  particular  item  is  simply  the  proportion  of  people  with 
the  raw  score  j  who  correctly  answered  the  item.   This  is  nothing  more 
than  the  item  difficulty  for  all  the  people  who  have  a  raw  score  of  j. 
The  probability  can  be  calculated  in  terms  of  the  item  difficulty  for 
all  raw  score  groups  from  1  to  (k-1)  and  across  all  items. 

The  probability  of  an  incorrect  response,  Q,  is  the  proportion  of 
people  answering  an  item  incorrectly.   Tliis  is  simply  one  minus  the 
proportion  answering  it  correctly  (1-P) .   Both  P  and  Q  are  easily 
calculated  from  a  set  of  data  hence  the  value  of  P/Q  in  formula  7  is 
an  easily  derived  statistic  which  estimates  the  odds. 
Separating  the  Parameters 

Consider  again  equation  7  and  take  the  natural  log  (In)  of  both 
sides  of  the  equation.   This  gives 

In  (^)  =  In  (|)   .  (8) 


\u 


Since  the  In  of  a  rtitio  is  the  same  as  the  difference  of  the  In. 
8  becomes 

In  (|)  =  In  6  -  In  C  •  (9) 

Person  Free  Item  Difficulties 

Next  consider  a  score  group  with  ability  6   and  two  test  items 


on 


with  difficulties  c  and  c,  respectively.  The  probability  of  a  pers 
with  ability  6  correctly  answering  an  item  of  difficulty  r  is  P.,. 
The  probability  of  an  incorrect  response  is  Q, ,  .  Tlie  probability  of 
a  person  with  ability  6  correctly  answering  an  item  of  difficulty  ^ 
is  P.-,  and  tlie  probability  of  an  incorrect  response  is  Q  -,.  By 
equation  9, 

Pll 

In  (^)  =  In  6   -  In  r  and         (10) 

^11 

In  (^)  =  In  9   -  In  c..  (11) 

^12 

If  equation  11  is  subtracted  from  equation  10,  term  by  term,  the  result 

is 

P         P 
In  (A    -    In  {—)    =    (In  9   -  In  Cj  -  (In  6   -  In  ^)  .  (12) 
^11        ^12 

Tills  is  the  same  as 

P         P 

In  (^)  -  In  (^)  =  In  9^  -  In  c^  -  In  9^  +  In  r,^     ,    (13) 

P         P 
or.   In  (^)  -  m  (^)  -  In  c^  "  In  C^-  (14) 

Ec[uation  14  should  be  examined  very  carefully.   On  the  left  side  of 
the  equation  is  an  easily  calculated  statistic:   The  difference 
between  the  In  odds  of  a  correct  response  on  item  1  compared  to  item  2. 
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The  right  side  is  significant  for  what  it  does  not  contain.   There  is 
no  parameter  for  tiie  person's  ability  on  the  right  side  of  tl\e  equation. 
Tliis  same  result  occurs  regardless  of  the  ability  of  the  person  or 
group  examined.   The  difference  between  the  difficulty  of  item  1  and 
the  difficulty  of  item  2  can  be  calculated  independently  of  the 
subject  or  group  of  subjects  involved.   In  general,  for  any  two  items 


of  difficulty  Ci  and  ;  ,  the  difference  between  the  difficulty  of  the 

1 m _^. „ „ . „ f_^ 

two  items  is  given  by 


in  C„^  -  in  C^  ^  in  iA    -   In  (^)  .    (15) 

1 1         i  m 

Item  Free  Person  Abilities 

The  discussion  of  person  ability  estimates  is  an  exact  parallel 
to  the  discussion  of  item  difficulty  estimates.   Instead  of  comparing 
two  items  across  any  group  of  examinees  the  discussion  of  ability 
proceeds  by  comparing  any  two  groups  on  any  test  item.   Consider  score 
group   with  ability  6   score  group  ^  with  ability  Q^,    and  item   with 
difficulty  Q    .      From  equation  9, 

^11 

In  (^^ — )  =  In  9   -  In  c,,  and  (16) 

^11 

^1 
In  (— )  =  In  e^  -  In  ^^  .  (17) 

Subtracting  equation  17  from  equation  16  will  yield 

P         P 
In  i-A    -    In  i-^)    -  In  8^  -  In  6       (18) 
^11        ^21 

The  difference  between  the  abilities  of  the  examinees  in  score  group  , 

and  score  group  ^  (the  right  side  of  equation  18)  is  described  without 

reference  to  the  item  involved.   In  general,  for  any  two  groups  with 


abilities  9 .  and  9  . , 
1      J 


lib 


In  6 


P.,        P.. 
In  a   =  m  (^)  -  In  (^) 


(19) 


for  any  item  with  difficulty  c  In  this  case  the  abilities  are  being 

compared  independently  of  the  difficulty  of  the  item  used  to  compare 
them.   Tliis  is  often  referred  to  as  item  free  person  ability  estimation. 
Formalizing  the  Model 

To  describe  the  Rasch  model  let  ln6  and  In C  be  re-defined. 
Specifically,  let, 

e  =  In  6,  and  (20) 

6  =  In  c.  ^ — .,._^^       (21) 

Equations  20  and  21  simply  define   the  In  ability  as  B  and  the  In 
difficulty  as  6.  — 

If  both  sides  of  equation  20  and  21  are  raised  to  the  base  of  the 


natural  log  system,  e,  we  get 


3     In  0 ,  and 
e  =  e 

6     In  c. 
e  =  e 


Recall  equation  5 


P  = 


6/C 


(22) 
(23) 

(24) 


1  +  e/c 

and  substitute  the  equivalent  terms  for  9  and  C  as  defined  in  equations 
22  and  25.   This  gives 
6 


P  = 


P  = 


e  /  e 


or 


1  +  e  /  e 


(  3-6  ) 


1  +  e 


(  3-6  ) 


(25) 


(26) 
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More  formally  this  is 


P  (X^.  =  1|  3,,  6.) 


ef  \  -  '0 


l.e^h-    ^^ 


(27) 


Equation  27  is  the  Rasch  model, 


APPENDIX  B 

RELATIVP.  EFFICIENCY  VALUES  USED  IN  FIGURE  2  FOR 
THE  COMPARISONS  AMONG  ITEM  ANALYTIC  METHODS 


TABLE  B.l 

RELATIVE  EFFICIENCY  VALUES  USED  IN  FIGURE  2  FOR 
THE  COMPARISONS  AMONG  ITEM  ANALYTIC  METHODS 


Percentile 
Rank 

Factor  Analysis 
to  Classical 

Rasch  to 
Classical 

Rasch  to 
Factor  Analysis 

1 

4.00 

.49 

.12 

10 

1.62 

.55 

.34 

20 

.72 

.49 

.63 

50 

.95 

.84 

.88 

40 

1.25 

1 .53 

1.06 

50 

.80 

1.00 

1.25 

60 

1.59 

1.30 

.94 

70 

1.17 

1.68 

1.43 

80 

.91 

.78 

.86 

90 

.67 

1.69 

2.54 

98 

.98 

.29 

.29 
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